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Preface 



Privacy in statistical databases is about finding tradeoffs to the tension between 
the increasing societal and economical demand for accurate information and the 
legal and ethical obligation to protect the privacy of individuals and enterprises, 
which are the source of the statistical data. Statistical agencies cannot expect 
to collect accurate information from individual or corporate respondents unless 
these feel the privacy of their responses is guaranteed; also, recent surveys of 
Web users show that a majority of these are unwilling to provide data to a Web 
site unless they know that privacy protection measures are in place. 

“Privacy in Statistical Databases 2004” (PSD 2004) was the final conference 
of the CASC project (“Computational Aspects of Statistical Confidentiality”, 
IST-2000-25069). PSD 2004 is in the style of the following conferences: “Statis- 
tical Data Protection”, held in Lisbon in 1998 and with proceedings published 
by the Office of Official Publications of the EC, and also the AMRADS project 
SDC Workshop, held in Luxemburg in 2001 and with proceedings published by 
Springer- Verlag, as LNCS Vol. 2316. 

The Program Committee accepted 29 papers out of 44 submissions from 15 
different countries on four continents. Each submitted paper received at least two 
reviews. These proceedings contain the revised versions of the accepted papers. 
These papers cover the foundations and methods of tabular data protection, 
masking methods for the protection of individual data (microdata), synthetic 
data generation, disclosure risk analysis, and software/case studies. 

Many people deserve our gratitude. The conference and these proceedings 
would not have existed without the Organization Chair, Enric Ripoll, and the 
Organizing Committee (Jordi Castella, Antoni Martinez, Francesc Sebe and Julia 
Urrutia). In evaluating the papers submitted we received the help of the Pro- 
gram Committee and four external reviewers (Jorg Hohne, Silvia Polettini, Yosef 
Rinott and Giovanni Seri). 

We also thank all the authors of submitted papers and apologize for possible 
omissions. 



March 2004 



Josep Domingo-Ferrer 
Viceng Torra 
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Abstract. The paper introduces into the methodology for disclosure limitation 
offered by the software package T- ARGUS. Those methods have been applied 
to the data sets of a library of close-to-real-life test instances. The paper pre- 
sents results of the tests, comparing the performance of the methods with re- 
spect to key issues such as practical applicability, information loss, and disclo- 
sure risk. Based on these results, the paper points out which of the alternative 
methods offered by the package is likely to perform best in a given situation. 



1 Introduction 

Data collected within government statistical systems is usually provided as to fulfil 
requirements of many users differing widely in the particular interest they take in the 
data. Data are published at several levels of detail in large tables, based on elaborate 
hierarchical classification schemes. In many cases, cells of these tables contain infor- 
mation on single, or very few respondents. In the case of establishment data, given the 
meta information provided along with the cell values (typically: industry, geography, 
size classes), those respondents could be easily identifiable. Therefore, measures for 
protection of those data have to be put in place. The choice is between suppressing 
part of the information (cell suppression), or perturbing the data. 

The software t-ARGUS [13], as emerging from the European project CASC 
( = Computational Aspects of Statistical Confidentiality) [12], offers methods to iden- 
tify sensitive cells, a choice of algorithms to select secondary suppressions, programs 
to compute interval bounds for suppressed cells (audit), and to generate synthetic 
values to replace suppressed original ones in a publication. Section 2 will introduce 
into the methods offered (or foreseen to be offered) by the package. 

These methods have been applied to data sets of a of a library of close-to-real-life 
test instances. Section 3 will present empirical results, comparing the performance of 
the methods with respect to key issues concerning practical applicability, information 
loss, and disclosure risk. 

As a conclusion from the test results, section 4 will provide some guidelines for 
users, recommending specific methods to apply in certain situations. 

2 Methodological Background 

t-ARGUS offers a variety of options for a disseminator to formulate protection re- 
quirements which will be discussed in section 2.1. When cell suppression is used as 
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disclosure limitation technique, in a first step sensitive cells will be suppressed {pri- 
mary suppressions). In a second step, other cells (so called "secondary’ or ‘comple- 
mentary’ suppressions) must be suppressed along with these so called ‘primary sup- 
pressions’ in order to prevent the possibility that users of the published table would be 
able to recalculate primary suppressions. The problem of finding an optimum set of 
suppressions is known as the ‘secondary cell suppression problem’, x- ARGUS offers 
a choice of algorithms to select secondary suppressions as outlined in section 2.2. 

By solving a set of equations implied by the additive structure of a statistical table, 
and some additional constraints on cell values (such as non-negativity) it is possible to 
obtain upper and lower bounds for the suppressed entries of a table. The package 
offers to derive the bounds of these so called ‘feasibility intervals’ (sec. 2.3). Based 
on ideas of [5], a method for controlled tabular adjustment (CTA) has been imple- 
mented to supply users with synthetic values located within those intervals which 
could be used to replace suppressed original values in a publication (sec. 2.4). 



2.1 Formulation of Protection Requirements 

X-ARGUS offers various options to formulate protection requirements. The software 
can be used to prevent exact disclosure of respondent data only, or to also avoid infer- 
ential disclosure to some degree. 

When it is enough to prevent exact disclosure of respondent data, users of 
X-ARGUS specify the parameter n of a minimum frequency rule. In that case, secon- 
dary suppressions would be selected in such a way that the width of the feasibility 
interval for any sensitive cell is non-zero, i.e. the interval does not contain the true 
cell value only. When it is not enough to prevent exact disclosure, but the risk of 
approximate disclosure must also be limited, users of x-ARGUS specify parameters of 
the p%-rule, or dominance rule. The goal is to find a set of secondary suppressions 
ensuring that the resulting bounds of the feasibility interval of any sensitive cell can- 
not be used to deduce bounds on an individual respondent contribution that are too 
close according to the sensitivity criterion employed. Results of [3,4] can be used to 
compute the so called ‘protection level’ . Bounds of the feasibility interval must not be 
closer than the protection level. Formulas corresponding to the p %- and (n,k)-domi- 
nance rule are given in table 3 of the appendix. 

It should, however, be mentioned here that some problems have not yet been fully 
solved in the current version of x-ARGUS: With some of the secondary cell suppres- 
sion algorithms offered, it may happen that the algorithm considers a single respon- 
dent cell to be properly protected even though there is only one other suppression in 
the same row/column/... of the table and this suppression is another single respondent 
cell. This may cause disclosure: the respondent contributing to either of the two single 
respondent cells will be able to recover the value of the other single respondent. Simi- 
lar problems of exact disclosure may arise with single respondent cells which are not 
in the same row/column/... of the table. Another problem still unsolved for any algo- 
rithm offered by the package, is the problem of assigning protection levels in such a 
way that aggregates published implicitly (so called ‘multi-cells’) in a protected table 
will always be non-sensitive. 
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2.2 Algorithms for Secondary Cell Suppression 

The goal of secondary cell suppression is to find a valid suppression pattern satisfying 
the protection requirements of the sensitive cells (see 2.1 above), while minimizing 
the loss of information associated with the suppressed entries. The ‘classical’ 
formulation of the secondary cell suppression problem is a combinatorial optimisation 
problem, which is computationally extremely hard to solve, x- ARGUS offers a variety 
of algorithms to find a valid suppression pattern even for sets of large hierarchical 
tables linked by linear interrelations. It is up to the user to trade-off quality vs. 
quantity, that is to decide how much resources (computation time, costs for extra 
software etc.) he wants to spend in order to improve the quality of the output tables 
with respect to information loss. The package offers a choice basically between four 
different approaches: 

OPTIMAL Fischetti/Salazar methodology aims at the optimal solution of the 
cell suppression problem [8]. A feasible solution is offered at an early stage of 
processing, which is then optimised successively. It is up to the user to stop execu- 
tion before the optimal solution has been found, and accept the solution reached so 
far. The user can also choose the objective of optimisation, i.e. choose between 
different measures of information loss. Note that the method relies on high per- 
formance, commercial OR solvers. 

MODULAR The HiTaS method [7] subdivides hierarchical tables into sets of 
linked, unstructured tables. The cell suppression problem is solved for each subt- 
able using Fischetti/Salazar methodology [8]. Backtracking of subtables avoids 
consistency problems when cells belonging to more than one subtable are selected 
as secondary suppressions. 

NETWORK The concept of an algorithm based on network flow methodology 
has been outlined in [1]. Castro’s algorithm aims at a heuristic solution of the CSP 
for 2-dimensional tables. Network flow heuristics are known to be highly effi- 
cient. It may thus turn out that the method is able to produce high quality solutions 
for large tables very quickly. x-ARGUS offers an implementation applicable to 2- 
dimensional tables with hierarchical substructure in one dimension. A license for a 
commercial OR solver will not be required to run the algorithm. 

HYPERCUBE The hypercube algorithm GHMITER developed by R.D. Repsilber 
([see 5,6]) is a fast alternative to the above three OR based methods. This heuristic 
is able to provide a feasible solution even for extremely large, complex tables 
without consuming much computer resources. The user, however, has to put up 
with a certain tendency for over-suppression. 

SINGLETON Special application of GHMITER, addressing only the protection of 
single respondent cells. The method is meant to be used as preprocessing for the 
OPTIMAL and NETWORK methods for which a solution for the problem with 
single respondent cells mentioned in sec. 2.1 has not yet been implemented. 

With respect to the hypercube and the modular method, both involving backtrack- 
ing of subtables, it should be noted that such methods are not ‘global’. This causes a 
certain disclosure risk (see [4] for problems related to non-global methods for secon- 
dary cell suppression.). 
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2.3 Audit: Computing the Feasibility Intervals 

By solving a set of equations implied by the additive structure of a statistical table, 
and some additional constraints on cell values (such as non-negativity) it is possible to 
obtain upper and lower bounds for the suppressed entries of a table. These bounds are 
solutions to the following linear programming problem (c.f. [8]): 

Min y,. ,and Max y. subject to 

I.iei>nijyi=bj JeJ 

Ib^ < y, < mZ?, ,is P(jS 
y,. = a,- ,i^ P(jS 

where the additive structure of the table is given by the set of linear equations 
- bj ,j^J (typically bj=0, and e {-l,0,l}). 1, P, and S denote 
the set of all cells, of the sensitive cells, and of the secondary suppressions, respec- 
tively, and ub^ , Ib^ are constraints on the cell values a, . X-ARGUS assumes 
ub^ — a, = a, — lb I =q- .By default, the parameter q is set to 1 . 



2.4 Controlled Tabular Adjustment for Tables with Suppressions 



The authors of [5] suggest controlled tabular adjustment (CTA) to compute synthetic 
values which could be used to replace suppressed original ones in a publication. The 
idea of CTA is to determine synthetic values that are ‘as close as possible’* to the 
original ones for the non-sensitive cells, but at some ‘safe’ distance for the sensitive 
cells. 

In the following, we consider a variant of this approach: As the idea of CTA is still 
fairly new, the method not yet established as a standard for tabular data protection, we 
thought it could be a natural way of familiarizing those who are used to tables pro- 
tected by cell suppression with the new methodology, if it is presented as ‘just to 
release some additional information’ on the suppressed entries. Therefore, while with 
CTA methods suggested so far (see [5, 6, 2, 11]) all cells are candidates for adjust- 
ment, in our variant adjustment is restricted to the suppressed cells of a protected 
table. Synthetic values are then obtained as solution to the following LP problem: 

r 

Min I w^y/" -I- I w,y“ -I- Iw,(y/' +y~)+W I y/" + I T,” 

ieP* ieP^ isS yieP^ ieP* 

subject to: 



'ZiePus^ijiy! - yt ) = bj -lieP^smijUi JeJ 

0<y^< max,- a,- , is P(jS 



0 < y,- < a, - min, , is PuS 
y,^ > upli , is P^ 



y;>lpli , is P , 



Using a LI distance. 
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where max^- and minj- denote the solutions to the LP problem of section 2.3, w, are 

weights obtained by a cost function such as w, = , VP is a very large constant, 

1 + a, 

and upli and //?/,■ are upper and lower protection levels for the sensitive cells com- 
puted according to 2.1. Synthetic values for tGPfj5are then defined as 

+ y! - yi ■ 

and P , the sets of sensitive cells which are adjusted to (or beyond) their up- 
per ( P^ ) or lower (P~ ) protection level, are determined in advance of solving the 
LP problem by a simple heuristic as outlined in [5]: Considering protection level and 

cell size, cells on the lowest level of the table are allocated to P^ and P in alternat- 
ing sequence. For allocation of higher level sensitive cells, we consider allocation of 
corresponding lower level sensitive cells. Unfortunately, our simple heuristic tends to 

give solutions where both, , and are positive for a few of the sensitive cells, 
i.e. where the synthetic cell value will be too close to the true cell value. For discus- 
sion of this problem, and a suggestion how to solve it, see [2]. 



3 Application 

The methods described in section 2 have been applied to data sets of a library of 
close-to-real-life test instances. Section 3.1 explains the test scenario. Section 3.2. 
compares the performance of the alternative cell suppression algorithms. Results of 
further processing (audit and CTA) are presented in section 3.3. 



3.1 Data Sets 

A synthetic datafile has been constructed based on typical real-life structural business 
data. The algorithm used for generating the synthetic data has been designed as to 
preserve those properties of typical tabulations of the data relevant for cell suppres- 
sion, i.e. structures of variables, location of sensitive and zero cells, cell sensitivity, 
and number of contributions for low frequency cells. The file consists of nearly 3 mio 
records. It offers three categorical (i.e. explanatory) variables, and a variety of re- 
sponse variables one of which was chosen for the applications^. Of the categorical 
variables, one offers a (7-level) hierarchical structure. For some of the tabulations, 
only one of the non-hierarchical variables was considered. The depth of the hierarchi- 
cal variable was varied. In this way six tables were generated, three 2-dimensional 
and three 3-dimensional ones with size (i.e. number of cells) varying between 460 000 
and 150 000 cells. A p%-rule was employed for primary suppression. See table 4 in 
the appendix for details. 



^ Due to technical problems with the current version of x-ARGUS the last 5 digits of the vari- 
able which had up to 15 digits were dropped in advance of tabulation. 




6 Sarah Giessing 



Except for the network flow method which was not yet available for application to 
hierarchical tables, all the algorithms listed in 2.2 have been applied to the six tables 
described above. Unfortunately, for the largest table (table 6) only the hypercube 
method ended properly. A run with the modular method could not be completed, runs 
with the optimal methods were not even attempted because of the expected exhaustive 
CPU usage (several days). Using CPLEX 7.5 as OR-solver, runs were carried out on a 
Windows NT PC, Intel Pentium III processor, 261 MB Ram. 

While for the 2-dimensional tables processing times were short enough to be of no 
concern, for the 3-dimensional tables, application of Linear Programming based 
methods took considerably more time than a run of the hypercube method. With in- 
creasing depth of hierarchical structure, the effectiveness of the modular implementa- 
tion (compared to ‘Optimal’) regarding reduction of execution time grows: for the 4- 
levels table 5, execution time is reduced by 9 hours (from 12hl0 to 3hl 1 ), while for 
the 3-levels table 4 the reduction is only 21 minutes (from lh33 to lhl2). See table 5 
(appendix) for details. 

As mentioned in 2.2, with method ‘Optimal’ it is up to the user to stop execution 
before the optimal solution has been found, and accept the solution reached so far. It 
should be noted that we actually made use of this option. Thus, not all the suppression 
patterns generated by this method can be considered truly ‘optimal’. 



3.2 Results of Secondary Cell Suppression 

This section compares performances of the algorithms on the test tables with respect 
to number and added values of the secondary suppressions. Concerning the LP-based 
methods (Mod, Opt, Si/Opt), results presented in table 1 below were obtained when 
using the response variable as cost function. 

Table 1. Information Loss due to Secondary Suppression. 



Table Hier. 



No Cells 



No Suppressions (%) 



Added Value of 



Levels 



Suppressions (%) 









Hyp Mod Opt Si/Opt 


Hyp 


Mod 


Opt 


Si/Opt 


1 


3 


460 


2-dimensional tables 

6.96 4.35 4.78 


7.61 


0.18 


0.05 


0.03 


0.05 


2 


4 


1050 


10.95 8.29 7.43 


11.52 


0.98 


0.62 


0.58 


0.71 


3 


6 


8230 


14.92 11.48 15.36 


17.97 


6.78 


1.51 


1.64 


2.06 


4 


3 


8280 


3-dimensional tables 

14.63 10.72 14.96 


16.44 


6.92 


1.41 


0.63 


1.58 


5 


4 


18900 


17.31 15.41 19.19 


20.00 


12.57 


3.55 


2.32 


4.48 


6 


6 


148140 


15.99 - 


- 


23.16 


- 


- 


- 



With respect to the number of secondary suppressions, method ‘Modular’ per- 
formed best on all tables except for table 2, where method ‘Optimal’ suppressed 9 
cells less. Except for the 2 smallest tables, ‘Optimal’ with cell value as cost function 
performs even worse than the hypercube method. 
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With respect to the added value of the secondary suppressions, method ‘Optimal’ 
gave the best results - with exception of table 3, where ‘Modular’ suppressed 0.92 % 
of the added value of the ‘Optimal’ suppression pattern. This might be explained 
considering that we stopped execution of ‘Optimal’ before the optimal solution was 
found, and secondly, an audit of the protected tables (see section 3.3) proved that for 
this instance the suppression pattern computed by the modular method does not sat- 
isfy the protection requirements (see remarks at the bottom of 2. 1 on non-globality of 
the method), and thus was to be rejected by the ‘Optimal’ method. 

Because the disclosure risk problem with single respondent cells mentioned at the 
bottom of sec. 2.1 has not yet been solved for method ‘Optimal’^, it should be used in 
practice only in combination with ‘Singleton’ preprocessing. With respect to number 
of suppressions, however, the combination ‘Singleton/OptimaT seems to perform 
poorly. But the picture changes a little, when we look at the added value of the sup- 
pressions: while method ‘Modular’ (and of course ‘Optimal’ as well) outperform the 
combined method in that respect as well, we find that the suppressions selected are 
much smaller than those obtained by ‘Hypercube’. 

Overall, the differences observed in performance are larger for the criterion ‘added 
value of suppressions’ as for ‘number of suppressions’. However, in our experience 
‘number of suppressions’ is a quality criterion that matters a lot to statisticians, espe- 
cially the number of suppressions on the higher levels of a table. Table 2 below pre- 
sents results for two selected combined levels I, and II of the tables. Level I rows 
summarize results for the top level (‘total’) cells of the non-hierarchical variable(s), 
level II rows refer to those cells on the two top levels of the hierarchical variable 
which are ‘inner’ cells with respect to the non-hierarchical variable for the 2- 
dimensional tables, and with respect to one (and only one) of the non-hierarchical 
variables for the 3-dimensional tables. Suppressions on these high levels of a table are 
usually considered quite undesirable. 

Rows with zero entries for all four methods were dropped from table 2. 

Table 2. Number of secondary suppressions for selected combined levels (in %). 



Table Combined Hypercube Modular Optimal Singleton/ 

Level Optimal 

2-dimensional tables 



1,2 


II 


7.41 


4.44 


4.44 


7.41 


3 


I 


5.83 


2.19 


2.67 


2.92 


II 


12.59 


6.67 


6.67 


17.04 


3-dimensional tables 


4 


II 


12.05 


5.13 


9.49 


11.03 


5 


I 


14.29 


0.00 


0.00 


3.81 


II 


15.64 


8.46 


9.74 


8.72 



^ Out of lack of suitable software we did not systematically check the results of applications of 
method ‘Optimal’ for the instances of section 3. However, even for the smallest of the test 
tables (table 1) it was easy to find a case where respondents to two of the single respondent 
cells by an analysis of the table protected by ‘Optimal’ would have been able to disclose each 
others contribution exactly. 
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Table 2 proves once more the superior performance of the optimization methods 
over the hypercube method. It also exhibits that, although the combined Single- 
ton/Optimal method tends to suppress more cells overall than the hypercuhe method, 
on the top levels of the tables it behaves better. 

In our experience, both criteria, number, as well as size of suppressed cells matter 
to statisticians. In comparison to ‘Optimal’, results of method ‘Modular’ tend to offer 
a better compromise between these two criteria. Presuming that another cost function 
for ‘Optimal’ might lead to a better compromise, we tested three alternative cost func- 
tions: number of suppressions (No), -,Jcell value + constant (cl)‘', and 

■yjcell value (c2). Results are available, however, only for table 3. For table 3, we 

also obtained results from two different applications of the Singleton/Optimal method 
(cost function: cell value), one stopped after 1 hour, the other one after 1 Vi hours of 
execution time. 

Table 6 (appendix) presents the results. In this experiment, using ‘number of sup- 
pressions’ as cost function resulted in a behavior similar to that of the hypercube 
method, with many suppressions on the top level (I). Cost functions cl, and c2 lead to 
less suppressions on this level. Cost function c2 however, results in poor performance 
regarding number of suppressions on the lower levels of the table. The modular 
method clearly offers a better comprise between criteria ‘number of suppressions’ and 
‘added value of suppressions’. 

Comparison of the performance of method ‘Optimal’ for different execution times 
(after preprocessing with ‘Singleton’) presented in the last two columns of table 6 
indicates that spending more time for the optimization indeed improves results, al- 
though the effect of changing the cost function was stronger. 

3.2.1 Further Processing of Tables Protected by Cell Suppression 

Using an implementation of the linear problem of section 2.3 delivered by University 
Ilmenau [14], we obtained audit results for the protected tables of the test set. In none 
of the tables we found insufficient protection in a suppression pattern computed by 
‘Optimal’, or ‘Singleton/OptimaT. This gives some empirical prove of the mathe- 
matical reliability, and correct implementation of the algorithm ‘Optimal’ also under- 
lying the ‘Modular’ method. 

In section 2.2, we mentioned non-globality of methods ‘Modular’ and ‘Hypercube’ 
as possible source of disclosure risk. Table 7 (appendix) gives empirical confirmation 
of this problem: while some of tables had been protected sufficiently, for others we 
actually found primary suppressions lacking protection. In some cases, the problem 
affected as much as between 5 and 6 percent of primary suppressions. 

University Ilmenau also delivered a straightforward implementation^ of the CTA 
method for tables with suppressions outlined in section 2.4 [14] using CPLEX as 
OR-solver. This algorithm yields synthetic values for suppressed entries located 
within the feasibility intervals obtained as solutions to the LP problem of section 2.3 . 
The bounds of the feasibility intervals do strongly depend on the constraints ub ^ , Ib^ 



Constant was chosen heuristically (a large value, close to the 99.5% quantile of the cell value 
square roots). 

^ The source code of the implementation will be made available on the CASC web site. 
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on the cell values a, which are supplied by the user. If these constraints are close to 
the cell values, feasibility intervals will be small, forcing in turn synthetic values close 
to the original cell values. Naturally, the constraints used to compute the feasibility 
intervals must not be closer than those assumed for the secondary cell suppression 
problem, because otherwise it is very likely that for some of the primary suppressions 
the bounds of the feasibility interval will be closer than the protection level. On the 
other hand, when we decrease the intervals given by the constraints for the secondary 
cell suppression problem, it will increase the number of suppressions: When we, for 
instance, used symmetric bounds ub^ — a, = a, —Ibj = q ■ , and varied q aX.\, 0.5, 

0.15, and 0.08 for application of method ‘Hypercube’ to our test table 4, we obtained 
suppression patterns with 1211, 1342, 1509 and 1639 secondary suppressions. 

Close constraints should thus only be used, when the goal is to provide synthetic 
values. 

Table 8 (appendix) presents distribution of synthetic values by percent deviation 
from the true cell values. For generation of the synthetic values, feasibility intervals 
with q = 0.09 had been computed for the ‘Hypercube’ suppression pattern of test table 
4 resulting at q = 0.08 . For more than 80 % of all non-zero cells, synthetic values 
deviate by less than 2%. While deviation for primary suppressions is between 2 % and 
10 % prevailingly (more than 80 % of cases), most of the secondary suppressions (ca. 
79 %) were changed by less than 2 %. Deviation exceeding 10 % was an exception 
observed only at very small cell values below 20 (note, that the mean cell value in 
table 4 is 1,124,740). 

With respect to disclosure risk, as mentioned in sec. 2.4, the heuristic tends to pro- 
vide solutions where a few of the synthetic values are too close to the original ones. 
This does mainly affect primary suppressions with very small protection levels (less 
or equal 2) - probably due to problems of numerical precision which might be fixed 
easily. In our experiments, we observed that the set of those primary suppressions 
where, after secondary suppression, protection lacks by at least 2 units (3 cells for the 
above instance) is a subset of those cells where synthetic values are too close by at 
least 2 units. For the above instance both sets were identical, that is 3 of 854 (i.e. 
0.35 %) synthetic values for primary suppressions are too close by at least 2 units. 



4 Summary and Conclusions 

The software x-ARGUS offers methods to identify sensitive cells, a choice of algo- 
rithms to select secondary suppressions, programs to compute interval bounds for 
suppressed cells (audit), and to generate synthetic values to replace suppressed origi- 
nal ones in a publication. The paper has compared the performance of the secondary 
cell suppression algorithms on a set of small to moderate sized 2- and 3-dimensional 
hierarchical tables considering information loss (number and added value of secon- 
dary suppressions at certain hierarchical levels of the tables), and disclosure risk. 

Results of this study prove that, in a situation where a user is interested in obtain- 
ing a suppression pattern for a single table with rather few, rather small secondary 
suppressions, preferably on the lower levels of the table, the best choice by far is to 
use the method ‘Modular’. For medium sized, 3-dimensional tables, long CPU times 
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(compared to the hypercube method) are a nuisance, but quality of the results clearly 
justify the additional computational effort. 

Results obtained by method ‘Optimal’ on the other hand were less convincing: 
firstly, the disclosure risk problem for single-contributor cells is not solved in the 
current implementation. Secondly, results depend strongly on the particular cost func- 
tion employed. If the cost function does not fully reflect the users idea of a good sup- 
pression pattern, performance of method ‘Optimal’ will not be worth the additional 
computational effort (compared to method ‘Modular’) which is quite considerable for 
3-dimensional tables with elaborate hierarchical structure. 

However, audit results prove that users of the modular, and the hypercube method 
face some disclosure risk: In our tables, we found up to 6 percent of primary suppres- 
sions where protection lacked. If a data provider cannot not put up with this, it could 
be an option to use method ‘Optimal’ for post processing: after auditing the suppres- 
sion pattern obtained from either of the heuristic methods, method ‘Optimal’ could be 
applied to the resulting table, addressing for protection only those primary suppres- 
sions with insufficient protection from the first run, while considering the bounds 
obtained from the audit as constraints on the cell values of the other (primary and 
secondary) suppressions. Because computation times for method ‘Optimal’ seem to 
depend strongly on the number of primary suppressions which would not be many for 
the post processing, computation times for the post-processing might turn out to be 
acceptable even for larger tables. 

In a situation where multiple linked, or extremely large 3-, or more dimensional ta- 
bles have to be protected, with the current version of x-ARGUS, the user is confined 
to method ‘Hypercube’. For suggestions how to improve the performance of this 
method for linked tables by specialized processing see [10]. 

First test results from the CTA method for tabular data are quite encouraging. If in 
further testing they should be confirmed, the method might eventually be released to 
make ARGUS users accustomed with CTA methodology. Further testing is required, 
concerning the choice of the cost function, the impact of the secondary cell suppres- 
sion algorithm employed, and of course it should be compared to alternative imple- 
mentations of the original CTA method [5, 6, 2, 11]. Extensions for linked tables 
should be designed, and it has to be explored how to communicate data quality of 
synthetic data to the users of those tables. 
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Appendix 



Table 3. Eormulas to compute the protection level®. 

Sensitivity rule Protection level 

p %-rule 

(n,k)-dominance rule 



100 -tp 

+ X2 - N 

100 

100 



E Xi~x 

k i<n 



® X denoting the cell value, Xj the ith individual contribution to the cell (contributions sorted 
by size). 
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Table 4. Structure of test tables. 

Number of Cells 

Table Hierarchical Total Zero Primary 

levels Suppressions 



1 


3 


2-dimensional tables 

460 


71 


18 


2 


4 


1050 


168 


60 


3 


6 


8230 


2061 


989 


4 


3 


3-dimensional tables 

8280 


2740 


848 


5 


4 


18900 


6937 


2198 


6 


6 


148140 


78008 


19029 









Table 5. CPU times (h:mm). 






Table 


Hier. 

Levels 


No Cells 




Hyp 


Mod 


Opt 


Si/Opt 


Audit 


2-dimensional tables 


1 


3 


460 




<0:01 


<0:01 


0:02 


0:00 


<0:01 


2 


4 


1050 




<0:01 


<0:01 


0:03 


0:01 


<0:01 


3 


6 


8230 




<0:01 


<0:01 


0:10 


1:02 

(1:31) 


<0:01 


3-dimensional tables 



4 


3 


8280 


0:01 


1:12 


1:33 


3:00 


ca. 0:02 


5 


4 


18900 


0:01 


3:11 


12:10 


15:00 


ca. 0:45 


6 


6 


148140 


0:05 


- 


- 


- 


- 



Table 6. Performance with alternative cost functions, and increased execution times (table 3 
data). 





Hyp 


Mod 


Opt 


Si/Opt 


Combined 








Cost Function 




CPU (min) 


Level 




Val 


Val 


cl 


c2 


No 


62 


91 


No Suppressions (%) 


Overall 


14.92 


11.48 


15.36 


12.48 


16.35 


12.64 


17.97 


16.23 


1 


5.83 


2.19 


2.67 


2.79 


2.07 


5.59 


2.92 


2.31 


11 


12.59 


6.67 


6.67 


15.28 


11.11 


11.11 


17.04 


14.81 


Added Value of Suppressions (%) 


Overall 


6.78 


1.51 


1.64 


4.75 


1.74 


6.05 


2.06 


1.86 


1 


2.35 


0.09 


0.06 


0.20 


0.06 


0.54 


0.11 


0.11 


11 


0.04 


0.01 


- 


1.55 


0.01 


1.32 


0.03 


0.02 




Survey on Methods for Tabular Data Protection in ARGUS 13 



Table 7. Audit results: Percentage of primary suppressions with insufficient protection. 



Table 


# Dim 


Hierarchical 

levels 


Primary 

Suppressions 


Percentage of primary 
snppressions with 
insufficient protection 










Mod 


Hyp 


1 




3 


18 


5.56 


0.00 


2 


2 


4 


60 


5.00 


0.00 


3 




6 


989 


5.46 


1.31 


4 


■J 


3 


848 


2.59 


0.24 


5 




4 


2198 


2.14 


5.64 



Table 8. Deviation to original values for test-table 4 CTA data. 



Percentage All cells Primary Secondary 

deviation Suppressions Suppressions 

Prot.Lev.>2 Prot.Lev.<2 



0.0 


3376 


0 


44 


285 


0.0 - 0.1 


202 


1 


0 


201 


0.1 - 0.5 


206 


5 


0 


201 


0.5 - 1.0 


117 


5 


1 


111 


1.0 - 1.5 


74 


8 


1 


65 


1.5 - 2.0 


58 


10 


0 


48 


2.0 - 5.0 


398 


214 


16 


168 


5.0 - 10.0 


351 


219 


63 


69 


10.0 - 15.0 


18 


0 


17 


1 


15.0 - 26.0 


13 


0 


13 


0 


26.0 - 50.0 


4 


0 


4 


0 


50.0 - 100.0 


0 


0 


0 


0 
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Abstract. Data swapping, a term introduced in 1978 by Dalenius and 
Reiss for a new method of statistical disclosure protection in confidential 
data bases, has taken on new meanings and been linked to new statistical 
methodologies over the intervening twenty-five years. This paper revis- 
its the original (1982) published version of the the Dalenius- Reiss data 
swapping paper and then traces the developments of statistical disclo- 
sure limitation methods that can be thought of as rooted in the original 
concept. The emphasis here, as in the original contribution, is on both 
disclosure protection and the release of statistically usable data bases. 

Keywords: Bounds table cell entries; Constrained perturbation; Con- 
tingency tables; Marginal releases; Minimal sufficient statistics; Rank 
swapping. 



1 Introduction 

Data swapping was first proposed by Tore Dalenius and Steven Reiss (1978) 
as a method for preserving confidentiality in data sets that contain categori- 
cal variables. The basic idea behind the method is to transform a database by 
exchanging values of sensitive variables among individual records. Records are 
exchanged in such a way to maintain lower-order frequency counts or marginals. 
Such a transformation both protects confidentiality by introducing uncertainty 
about sensitive data values and maintains statistical inferences by preserving 
certain summary statistics of the data. In this paper, we examine the influence 
of data swapping on the growing field of statistical disclosure limitation. 

Concerns over maintaining confidentiality in public-use data sets have in- 
creased since the introduction of data swapping, as has access to large, comput- 
erized databases. When Dalenius and Reiss first proposed data swapping, it was 
in many ways a unique approach the problem of providing quality data to users 

* Currently Visiting Researcher at CREST, INSEE, Paris, France. 
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while protecting the identities of subjects. At the time most of the approaches 
to disclosure protection had essentially no formal statistical content, e.g., see 
the 1978 report of the Federal Committee on Statistical Methodology, FCSM 
(1978), for which Dalenius served as as a consultant. 

Although the original procedure was little-used in practice, the basic idea 
and the formulation of the problem have had an undeniable influence on subse- 
quent methods. Dalenius and Reiss were the first to cast disclosure limitation 
firmly as a statistical problem. Following Dalenius (1977), Dalenius and Reiss 
define disclosure limitation probabilistically. They argue that the release of data 
is justified if one can show that the probability of any individual’s data being 
compromised is appropriately small. They also express a concern regarding the 
usefulness of data altered by disclosure limitation methods by focusing on the 
type and amount of distortion introduced in the data. By construction, data 
swapping preserves lower order marginal totals and thus has no impact on in- 
ferences that derive from these statistics. 

The current literature on disclosure limitation is highly varied and combines 
the efforts of computer scientists, official statisticians, social scientists, and statis- 
ticians. The methodologies employed in practice are often ad hoc, and there are 
only a limited number of efforts to develop systematic and defensible approaches 
for disclosure limitation (e.g., see FCSM, 1994; and Doyle et ah, 2001). Among 
our objectives here are the identification of connections and common elements 
among some of the prevailing methods and the provision of a critical discus- 
sion of their comparative effectiveness^. What we discovered in the process of 
preparing this review was that many of those who describe data swapping as a 
disclosure limitation method either misunderstood the Dalenius-Reiss arguments 
or attempt to generalize them in directions inconsistent with their original pre- 
sentation. 

The paper is organized as follows. First, we examine the original proposal 
by Dalenius and Reiss for data swapping as a method for disclosure limitation, 
focusing on the formulation of the problem as a statistical one. Second, we ex- 
amine the numerous variations and refinements of data swapping that have been 
suggested since its initial appearance. Third, we discuss a variety of model-based 
methods for statistical disclosure limitation and illustrate that these have basic 
connections to data swapping. 

2 Overview of Data Swapping 

Dalenius and Reiss originally presented data swapping as a method for disclosure 
limitation for databases containing categorical variables, i.e., for contingency 
tables. The method calls for swapping the values of sensitive variables among 
records in such a way that the t-order frequency counts, i.e., entries in the the 

^ The impetus for this review was a presentation delivered at a memorial session for 
Tore Dalenius at the 2003 Joint Statistical Meetings in San Franciso, California. Tore 
Dalenius made notable contributions to statistics in the areas of survey sampling and 
confidentiality. In addition to the papers we discuss here, we especially recommend 
Dalenius (1977, 1988) to the interested reader. 
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t-way marginal table, are preserved. Such a transformed database is said to be 
t-order equivalent to the original database. 

The justification for data swapping rests on the existence of sufficient num- 
bers of t-order equivalent databases to introduce uncertainty about the true 
values of sensitive variables. Dalenius and Reiss assert that any value of a sensi- 
tive variable is protected from compromise if there is at least one other database 
or table, t-order equivalent to the original one, that assigns it a different value. It 
follows that an entire database or contingency table is protected if the values of 
sensitive variables are protected for each individual. The following simple exam- 
ple demonstrates how data swaps can preserve second-order frequency counts. 

Example: Table 1 contains data for three variables for seven individuals. Sup- 
pose variable X is sensitive and we cannot release the original data. In particular, 
notice that record number 5 is unique and is certainly at risk for disclosure from 
release of the three-way tabulated data. However, is it safe to release the two-way 
marginal tables? 

Table lb shows the table after a data-swapping transformation. Values of X 
were swapped between records 1 and 5 and between records 4 and 7. When we 
display the data in tabular form as in Table 2, we see that the two-way marginal 
tables have not changed from the original data. Summing over any dimension 
results in the same 2-way totals for the swapped data as for the original data. 
Thus, there are at least two data bases that could have generated the same set 
of two-way tables. The data for any single individual cannot be determined with 
certainty from the release of this information alone. 

Table 1. Swapping X values for two pairs of records in a 3- variable hypothetical 
example 



(a) Original Data (b) Swapped Data 



Record 


X 


Y 


z 


Record 


X 


Y 


z 


1 


0 


1 


0 


1 


1 


1 


0 


2 


0 


1 


0 


2 


0 


1 


0 


3 


0 


0 


1 


3 


0 


0 


1 


4 


0 


0 


1 


4 


1 


0 


1 


5 


1 


1 


1 


5 


0 


1 


1 


6 


1 


0 


0 


6 


1 


0 


0 


7 


1 


0 


0 


7 


0 


0 


0 



An important distinction arises concerning the form in which data are re- 
leased. Releasing the transformed data set as microdata clearly requires that 
enough data are swapped to introduce sufficient uncertainty about the true val- 
ues of individuals’ data. In simple cases such as the example in Table 1 above, 
appropriate data swaps, if they exist, can be identified by trial and error. However 
identifying such swaps in larger data sets is difficult. An alternative is to release 
the data in tabulated form. All marginal tables up to order t are unchanged by 
the transformation. Thus, tabulated data can be released by showing the exis- 
tence of appropriate swaps without actually identifying them. Schlorer (1981) 
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Table 2. Tabular versions of original and swapped data from Table 1 

(a) Original Data (a) Swapped Data 

Z Z 



0 10 1 



X 


Y 

0 1 


X 


Y 

0 1 


X 


Y 

0 1 


X 


Y 

0 1 


"o' 


0 


2 


"o 


2 


0 


IT 


1 


1 


T 


1 


1 


1 


2 


0 


1 


0 


1 


1 


1 


1 


1 


1 


0 



discusses some the trade-offs between the two approaches and we return to this 
issue later in the context of extensions to data swappping. 

Dalenius and Reiss developed a formal theoretical framework for data swap- 
ping upon which to evaluate its use as a method for protecting confidentiality. 
They focus primarily on the release of data in the form of 2-way marginal to- 
tals. They present theorems and proofs that seek to determine conditions on the 
number of individuals, variables, and the minimum cell counts under which data 
swapping can be used to justify the release of data in this form. They argue 
that release is justified by the existence of enough 2-order equivalent databases 
or tables to ensure that every value of every sensitive variable is protected with 
high probability. 

In the next section we discuss some of the main theoretical results presented 
in the paper. Many of the details and proofs in the original text are unclear, and 
we do not attempt to verify or replace them. Most important for our discussion 
is the statistical formulation of the problem. It is the probabilistic concept of 
disclosure and the maintenence of certain statistical summaries that has proved 
influential in the field. 

2.1 Theoretical Justification for Data Swapping 

Consider a database in the form of an x V matrix, where N is the number of 
individuals and V is the number of variables. Suppose that each of the V variables 
is categorical with r > 2 categories. Further define parameters Oj, i > 1, that 
describe lower bounds on the marginal counts. Specifically, Ui = N/rrii where rrii 
is the minimum count in the i-way marginal table. 

Dalenius and Reiss consider the release of tabulated data in the form of 2-way 
marginal tables. In their first result, they consider swapping values of a single 
variable among a random selection of k individuals. They then claim that the 
probability that the swap will result in a 2-equivalent database is 



Observations: 

1. The proof of this result assumes that only 1 variable is sensitive. 

2. The proof also assumes that variables are independent. Their justification 
is: “each pair of categories will have a large overlap with respect to A:.” 
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But the specific form of independence is left vague. The 2-way margins for 
X are in fact the minimal sufficient statistics for the model of conditional 
independence of the other variables given X (for further details, see Bishop, 
Fienberg, and Holland, 1975). 

Dalenius and Reiss go on to present results that quantify the number of 
potential swaps that involve k individuals. Conditions on V, N, and 02 follow 
that ensure the safety of data released as 2-order statistics. However the role of 
k in the discussion of safety for tabulated data is unclear. First they let k = V 
to get a bound on the expected number of data swaps. The first main result is: 

Theorem 1 . If V < N/a 2 , V > 4, and N > 

for some function F then the expected number of possible data-swaps of k = V 
individuals involving a fixed variable is > F. 



Unfortunately, no detail or explaination is given about the function F. Condi- 
tions on V , N, and 02 that ensure the safety of data in 2-way marginal tables 
are stated in the following theorem: 



Theorem 2. IfV< Nfa^, and 



N 

{log(57VUp*)}2/('^-i) 






where p* = log(l— p)/ log(p), then, with probability p , every value in the database 
is 2- safe. 



Observations: 



1. The proof depends on the previous result that puts a lower bound on the 
expected number of data swaps involving k = V individuals. Thus the result 
is not about releasing all 2-way marginal tables but only those involving a 
specie variable, e.g., X. 

2. The lower bound is a function F, but no discussion of F is provided. 

In reading this part of the paper and examining the key results, we noted 
that Dalenius and Reiss do not actually swap data. They only ask about possible 
data swaps. Their sole purpose appears to have been to provide a framework for 
evaluating the likelihood of disclosure. 

In part, the reason for focusing on the release of tabulated data is that identi- 
fying suitable data swaps in large databases is difficult. Dalenius and Reiss do ad- 
dress the use of data swapping for release of microdata involving non-categorical 
data. Here, it is clear that a database must be transformed by swapping before 
it can safely be released; however, the problem of identifying enough swaps to 
protect every value in the data base turns out to be computationally impractical. 
A compromise, wherein data swapping is performed so that t-order frequency 
counts are approximately preserved, is suggested as a more feasible approach. 
Reiss (1984) gives this problem extensive treatment and we discuss it in more 
detail in the next section. 
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We need to emphasize that we have been unable to verify the theoretical 
results presented in the paper, although they appear to be more specialized that 
the exposition suggests, e.g., being based on a subset of 2-way marginals and not 
on all 2-way marginals. This should not be surprising to those faminiliar with the 
theory of log-linear models for contingency tables, since the cell probabilities for 
the no 2nd-order interaction model involving the 2- way margins does not have an 
explicit functional representation (e.g., see Bishop, Fienberg, and Holland, 1975). 
For similar reasons the extension of these results to orders greater than 2 is far 
from straightforward, and may involve only marginals that specify decomposable 
log-linear models (c.f.. Dobra and Fienberg, 2000). 

Nevertheless, we find much in the authors’ formulation of the disclosure limi- 
tation problem that is important and interesting, and that has proved influential 
in later theoretical developments. We summarize these below. 

1. The concept of disclosure is probabilistic and not absolute: 

(a) Data release should be based on an assessment of the probability of the 
occurrence of a disclosure, c.f., Dalenius (1977). 

(b) Implicit in this conception is the trade-off between protection and util- 
ity. Dalenius also discusses this in his 1988 Statistics Sweden monograph. 
He notes that essentially there can be no release of information without 
some possibility of disclosure. It is in fact the responsibility of data man- 
agers to weigh the risks. Subjects/respondents providing data must also 
understand this concept of confidentiality. 

(c) Recent approaches rely on this trade-off notion, e.g., see Duncan, et al. 
(2001) and the Risk-Utility frontiers in NISS web-data-swapping work 
(Gomatam, Karr, and Sanil, 2004). 

2. Data utility is defined statistically: 

(a) The requirement to maintain a set of marginal totals places the emphasis 
on statistical utility by preserving certain types of inferences. Although 
Dalenius and Reiss do not mention log-linear models, they are clearly 
focused on inferences that rely on t-way and lower order marginal totals. 
They appear to have been the first to make this a clear priority. 

(b) The preservation of certain summary statistics (at least approximately) 
is a common feature among disclosure limitation techniques, although 
until recently there was little reference to the role these statistics have 
for inferences with regard to classes of statistical models. 

We next discuss some of the immediate extensions by Delanius and Reiss 
to their original data swapping formulation and its principal initial application. 
Then we turn to what others have done with their ideas. 



2.2 Data Swapping for Microdata Releases 

Two papers followed the original data swapping proposal and extended those 
methods. Reiss (1984) presented an approximate data swapping approach for 
the release of microdata from categorical databases that approximately preserves 
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t-order marginal totals. He computed relevant frequency tables from the original 
database, and then constructed a new database elementwise to be consistent 
with these tables. To do this he randomly selected the value of each element 
according to probability distribution derived from the original frequency tables 
and and then updated the table each time he generated a new element. 

Reiss, Post, and Dalenius (1982) extended the original data swapping idea 
to the release of microdata files containing continuous variables. For continu- 
ous data, they chose data swaps to maintain generalized moments of the data, 
e.g., means, variances and covariances of the set of variables. As in the case of 
categorical data, finding data swaps that provide adequate protection while pre- 
serving the exact statistics of the original database is impractical. They present 
an algorithm for approximately preserving generalized fcth order moments for 
the case of k = 2. 

2.3 Applying Data Swapping to Census Data Releases 

The U.S. Census Bureau began using a variant of data swapping for data releases 
from the 1990 decennial census. Before implementation, the method was tested 
with extensive simulations, and the release of both tabulations and microdata 
was considered (for details, see Navarro, et al. (1988) and Griffin et al. (1989)). 
The results were considered to be a success and essentially the same methodology 
was used for actual data releases. 

Fienberg, et al. (1996) describe the specifics of this data swapping methodol- 
ogy and compare it against Dalenius and Reiss’ proposal. In the Census Bureau’s 
version, records are swapped between census blocks for individuals or households 
that have been matched on a predetermined set of k variables. The {k + l)-way 
marginals involving the matching variables and census block totals are guaran- 
teed to remain the same; however, marginals for tables involving other variables 
are subject to change at any level of tabulation. But, as Willenborg and de Waal 
(2001) note, swapping affects the joint distribution of swapped variables, i.e, 
geography, and the variables not used for matching, possibly attenuating the 
association. One might aim to choose the matching variables to approximate 
conditional independence between the swapping variables and the others. 

Because the swapping is done between blocks, this appears to be consistent 
with the goals of Dalenius and Reiss, at least as long as the released marginals 
are those tied to the swapping. Further, the method actually swaps a specified 
(but unstated) number of records between census blocks, and this becomes a 
data base from which marginals are released. However the release of margins 
that have been altered by swapping suggests that the approach goes beyond the 
justification in Dalenius and Reiss. 

Interestingly, the Census Bureau description of their data swapping methods 
makes little or no reference to Dalenius and Reiss’s results, especially with regard 
to protection. As for ultility, the Bureau focuses on achieving the calculation 
of summary statistics in released margins other than those left unchanged by 
swapping (e.g., correlation coefficients) rather than on inferences with regard to 
the full cross-classification. 
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Procedures for the U.S. 2000 decennial census were similar, although with 
modifications (Zayatz 2002). In particular, unique records that were at more 
risk of disclosure were targeted to be involved in swaps. While the details of the 
approach remain unclear, the Office of National Statistics in the United Kingdom 
has also applied data swapping as part of its disclosure control procedures for 
the U.K. 2001 census releases (see ONS, 2001). 

3 Variations on a Theme— Extensions and Alternatives 

3.1 Rank Swapping 

Moore (1996) described and extended the rank-based proximity swapping algo- 
rithm suggested for ordinal data by Brian Greenberg in an 1987 unpublished 
manuscript. The algorithm finds swaps for a continuous variable in such a way 
that swapped records are guaranteed to be within a specified rank-distance of 
one another. It is reasonable to expect that multivariate statistics computed from 
data swapped with this algorithm will be less distorted than those computed af- 
ter an unconstrained swap. Moore attempts to provide rigorous justification for 
this, as well as conditions on the rank-proximity between swapped records that 
will ensure that certain summary statistics are preserved within a specified in- 
terval. The summary statistics considered are the means of subsets of a swapped 
variable and the correlation between two swapped variables. Moore makes a cru- 
cial assumption that values of a swapped variable are uniformly distributed on 
the interval between its bottom-coded and top-coded values, although few of 
those who have explored rank swapping have done so on data satisfying such an 
assumption. He also includes both simulations (e.g., for skewed variables) and 
some theoretical results on the bias introduced by two independent swaps on the 
correlation coefficient. 

Domingo-Ferrer and Torra (2001a, 2001b) use a simplified version of rank 
swapping and in a series of simulations of microdata releases and claim that 
it provides superior performance among methods for masking continuous data. 
Trotinni (2003) critiques their performance measures and suggests great caution 
in interpreting their results. 

Carlson and Salabasis (2002) also present a data-swapping technique based 
on ranks that is appropriate for continuous or ordinally scaled variables. Let X 
be such a variable and consider two databases containing independent samples 
of X and a second variable, Y Suppose that these databases. S'! = [Xi, V] and 
S '2 = [^ 2 ; M 2 ] are ranked with respect to X. Then for large sample sizes, the 
corresponding ordered values of Xi and X 2 should be approximately equal. The 
authors suggest swapping Xi and X 2 to form the new databases, S* = [V,M 2 ] 
and S '2 = [Ml 2 yi]. The same method can be used given only a single sample by 
randomly dividing the database into two equal parts, ranking and performing 
the swap, and then recombining. 

Clearly this method, in either variation, maintains univariate moments of 
the data. Carlson and Salabasis’ primary concern, however, is the effect of the 
data swap on the correlation between X and Y. They examine analytically 
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the case where X and Y are bivariate normal with correlation coefficient p, 
using theory of order statistics and find bounds on p. The expected deterioration 
in the association between the swapped variables increases with the absolute 
magnitude of p and decreases with sample size. They support these conclusions 
by simulations. 

While this paper provides the first clear statistical description of data swap- 
ping in the general non-categorical situation, it has a number of shortcomings. 
In particular, Fienberg (2002) notes that: (1) the method is extremely waste- 
ful of the data, using 1/2 or 1/3 according to the variation chosen and thus is 
highly ineffecient. Standard errors for swapped data are approximately 40% to 
90% higher than for the original unswapped data; (2) the simulations and theory 
apply only to bivariate correlation coefficients and the impact of the swapping 
on regression coefficients or partial correlation coefficients is unclear. 



3.2 NISS Web-Based Data Swapping 

Researchers at the National Institute of Statistical Science (NISS), working with 
a number of U.S. federal agencies, have developed a web-based tool to perform 
data swapping in databases of categorical variables. Given user-specified param- 
eters such as the swap variables and the swap rate, i.e., the proportion of records 
to be involved in swaps, this software produces a data set for release as micro- 
data. For each swapping variable, pairs of records are randomly selected and 
values for that variable exchanged if the records differ on at least one of the 
unswapped attributes. This is performed iteratively until the designated num- 
ber of records have been swapped. The system is described in Gomatam, Karr, 
Ghunhua, and Sanil (2003). Documentation and free downloadable versions of 
the software are available from the NISS web-page, www.niss.org. 

Rather than aiming to preserve any specific set of statistics, the NISS pro- 
cedure focuses on the trade-off between disclosure risk and data utility. Both 
risk and utility diminish as the number of swap variables and the swap rate 
increase. For example, a high swapping rate implies that data are well-protected 
from compromise, but also that their inferential properties are more likely to be 
distorted. Gomatam, Karr and Sanil (2004) formulate the problem of choosing 
optimal values for these parameters as a decision problem that can be viewed 
in terms of a risk-utility frontier. The risk-utility frontier identifies the greatest 
amount of protection achievable for any set of swap variables and swap rate. 

One can measure risk and utility in a variety of ways, e.g., the proportion 
of unswapped records that fall into small-count cells (e.g., with counts less than 
3) in the tabulated, post-swapped data base. Gomatam and Karr (2003, 2004) 
examine and compare several “distance measures” of the distortion in the joint 
distributions of categorical variables that occurs as a result of data swapping, 
including Bellinger distance, total variation distance, Gramer’s V, the contin- 
gency coefficient G, and entropy. Gomatam, Karr, and Sanil (2004) consider a 
less general measures of utility — the distortion in inferences from a specific 
statistical analysis, such as a log-linear model analysis. 
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Given methods for measuring risk and utility, one can identify optimal re- 
leases are empirically by first generating a set of candidate releases by performing 
data swapping with a variety of swapping variables and rates and then measur- 
ing risk and utility on each of the candidate releases and provide a means of 
making comparisons. Those pairs that dominate in terms of having low risk and 
high utility comprise a risk-utility frontier that leads optimal swaps for allow- 
able levels of risk. Gomatam, Karr, and Sanil (2003, 2004) provide a detailed 
discussion of choosing swap variables and swap rates for microdata releases of 
categorical variables. 

3.3 Data Swapping and Local Recoding 

Takemura (2002) suggests a disclosure limitation procedure for microdata that 
combines data swapping and local recoding (similar to micro-aggregation). First, 
he identifies groups of individuals in the database with similar records. Next, he 
proposes “obscuring” the values of sensitive variables either by swapping records 
among individuals within groups, or recoding the sensitive variables for the entire 
group. The method works for both continuous and categorical variables. 

Takemura suggests using matching algorithms to identify and pair similar 
individuals for swapping, although other methods (clustering) could be used. 
The bulk of the paper discusses optimal methods for matching records, and 
in particular he focuses on the use of Edmond’s algorithm which represents 
individuals as nodes in a graph, linking the nodes with edges to which we attach 
weights, and then matches individuals by a weighting maximization algorithm. 
The swapping version of the method bears considerable resemblance to rank 
swapping, but the criterion for swapping varies across individuals. 

3.4 Data Shuffling 

Mulalidhar and Sarathy (2003a, 2003b) report on their variation of data swap- 
ping which they label as data shuffling, in which they propose to replace sensitive 
data by simulated data with similar distributional properties. In particular, sup- 
pose that X represents sensitive variables and S non-sensitive variables. Then 
they propose a two step approach: 

— Generate new data Y to replace X by using the conditional distribution of 
X given S, /(X|S), so that /(X|S, Y) = /(X|S). Thus they claim that the 
released versions of the sensitive data, i.e., Y, provide an intruder with no 
additional information about /(X|S). One of the problems is, of course, that 
/ is unknown and thus there is information in Y. 

— Replace the rank order values of Y with those of X, as in rank swapping. 

They provide some simulation results that they argue show the superiority of 
their method over rank swapping in terms of data protection with little or no 
loss in the ability to do proper inferences in some simple bivariate and trivariate 
settings. 
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4 Data Swapping and Model-Based Statistical Methods 

We can define model-based methods in two ways: (a) methods that use a specific 
model to perturb or transform data to protect confidentiality; or (b) methods 
that involve some perturbation or transformation to protect confidentiality, but 
preserve minimal sufficient statistics for a specific model, thereby maintaining 
the data users’ inferences under that model. The former is exemplified by post- 
randomization methodologies and the latter by work on the release of margins 
from contingency tables or perturbed tables from conditional distributions. We 
describe these briefly in turn. 

4.1 Post Randomization Method PRAM 

The Post Randomization Method (PRAM) is a perturbation method for cate- 
gorical databases (Gouweleeuw, et ah, 1998). Suppose that a sensitive variable 
has categories 1, . . . , m. In PRAM, each value of the variable in the database is 
altered according to a predefined transition probability (Markov) matrix. That 
is, conditional on its observed value, each value of the variable is assigned one of 
1, . . . , m. Thus, observations either remain the same or are changed to another 
possible value, all with known probability. This is essentially Warner’s (1965) 
method of randomized response but applied after the data are collected rather 
than before. Willenborg and de Waal (2001) note some earlier proposals of a 
similar nature and describe PRAM in a way that subsumes data swapping. 

The degree of protection provided by PRAM depends on the probabilities 
in the transition matrix, as well as the frequencies of observations in the origi- 
nal database. PRAM has little effect on frequency tables. Given the transition 
matrix, it is straightforward to estimate the univarite frequencies of the original 
data, as well as the additional variance introduced by the method. The precise 
effect on more complicated analyses, such as regression models, can be difficult 
to assess. See the related work in the computer science literature by Agrawal 
and Srikant (2000) and Evfimievski, Gehrke, and Srikant (2003). 

4.2 Model-Based Approaches for the Release of Marginals 
and Other Statistics 

Fienberg, Steele, and Makov (1996, 1998) suggest “bootstrap-like” sampling from 
the empirical distribution of the data, and then releasing the sampled data for 
analysis. Multiple replicates are required to assess the the added variability of es- 
timates when compared with the those that could be generated from the original 
data. In the case of categorical data, this procedure is closely related to the prob- 
lem of generating entries in a contingency table given a fixed set of marginals. 
Preserving marginal totals is equivalent to preserving sufficient statistics of cer- 
tain log-linear models. Diaconis and Sturmfels (1978) developed an algorithm for 
generating such tables using Grobner bases. Dobra (2003) shows that such bases 
correspond to simple data swaps of the sort used by Delanius and Reiss when the 
corresponding log-linear model is decomposable, e.g., conditional independence 
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of a set of fc — 1 variables given the remaining one in a fc-way contingency table. 
See Karr, Dobra, and Sanil (2003) for a web-based implementation. 

The Dalenius and Reiss data swap preserves marginal totals of tables up 
to order t, and so can be viewed as a model-based method with respect to a 
log-linear model. In general, the set of tables that could be generated by data 
swapping is a subset of those that could be generated by the Diaconis and Sturm- 
fels algorithm because of non-simple basis elements required to generate the full 
conditional distribution, e.g., see Diaconis and Sturmfels (1998) and Fienberg, 
Makov, Meyer, and Steele (2001). By comparison, the resampling method of 
Domingo-Ferrer and Mateo-Sanz (1999) is not model-based and can only be 
used to preserve a single margin. It has the further drawback of fixing sampling 
zeros, thereby limiting its usefulness in large sparse contingency tables. 

Burridge (2003) extended the approach in Fienberg, Makov and Steele, for 
databases with continuous variables. Denote the database by {X,S), where X 
represents sensitive variables that cannot be disclosed and S denote the remain- 
ing variables. Let T be a minimal sufficient statistic for the distribution of X 
given S. Values of the sensitive variables are replaced with a random sample, Y , 
from the distribution of X given (T, S) and the database {Y, S) is released. The 
idea is that the minimal sufficient statistic T will be preserved in the released 
database. 

Clearly this method makes strong assumptions about the distributional prop- 
erties of the data. In the case of discrete variables where the distribution comes 
from the exponential family, the results of Diaconis and Sturmfels apply again. 
Burridge proposes the method specifically for the case where X [S' is multivariate 
normal with mean x/3 and covariance matrix S. He estimates sufficient statistics 
by fitting a separate linear regression model to each column of X and construct- 
ing the matrices [3 and S, and he describes methods for generating perturbed 
data Y that preserve the conditional mean and variance of XjS'. He also dis- 
cusses the level of protection realized by this procedure, how general the ap- 
proach is and how it relates to the other kinds of data swapping objectives in 
the non-categorical case remains to be seen. Note the similarity here to ideas 
in Mulalidhar and Sarathy (2003a, 2003b) but with a more formal statistical 
justification. 

5 Discussion 

In this paper we have revisited the original work of Dalenius and Reiss on 
data swapping and surveyed the some of the literature and applications it has 
spawned. In particular, we have noted the importance of linking the idea of data 
swapping to the release of marginals in a contingency table that are useful for 
statistical analysis. This leads rather naturally to a consideration of log-linear 
models for which marginal totals are minimal sufficient statistics. Although Dale- 
nius and Reiss made no references to log-linear models, they appear in retrospect 
to provide the justification for much of the original paper. A key role in the rele- 
vant theory is played by the conditional distribution of a log-linear model given 
its marginal minimal sufficient statistics. 
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There is an intimate relationship between the calculation of bounds for cell 
entries in contingency tables given a set of released marginals (Dobra and Fien- 
berg, 2000,2001) and the generation of tables from the exact distribution of a 
log-linear model given its minimal sufficient statistics marginals. Work by Aoki 
and Takemura (2003) and unpublished results of de Loera and Ohn effectively 
demonstrate the possibility that the existence of non-simple basis elements can 
yield multi-modal exact distributions or bounds for cells where there are gaps 
in realizable values. These results suggest that data swapping as originally pro- 
posed by Dalenius and Reiss does not generalize in ways that they thought. But 
the new mathematical and statistical tools should allow us to reconsider their 
work and evolve a statistically-based methodology consistent with their goals. 
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Abstract. In recent work on statistical methods for confidentiality and 
disclosure limitation. Dobra and Fienberg (2000, 2003) and Dobra (2002) 
have generalized Bonferroni-Frechet-Hoeffding bounds for cell entries in 
fc-way contingency tables given marginal totals. In this paper, we con- 
sider extensions of their approach focused on upper and lower bounds 
for cell entries given arbitrary sets of marginals and conditionals. We 
give a complete characterization of the two-way table problem and dis- 
cuss some implications to statistical disclosure limitation. In particular, 
we employ tools from computational algebra to describe the locus of all 
possible tables under the given constraints and discuss how this addi- 
tional knowledge affects the disclosure. 

Keywords: Confidentiality; Contingency tables; Integer programming; 
Linear programming; Markov bases; Statistical disclosure control; Tab- 
nlar data. 



1 Introduction 

Current disclosure limitation methods either change the data (e.g. cell suppres- 
sion, controlled rounding, and data swapping) or release partial information (e.g. 
releasing marginals). Methods falling under the first category either have limited 
statistical content (e.g. cell suppression) or modify existing statistical correla- 
tions (e.g. controlled rounding), thus modifying the proper statistical inference. 
The third method insures confidentiality by restricting the level of detail at which 
data are released, as opposed to changing the data as with other methods. Re- 
leases of partial information such as releasing only the margins often allows for 
proper statistical inference. In this setting. Dobra and Fienberg ([8,6,9]) used 
undirected graphical representations for log-linear models as a framework to 
compute bounds on cell entries given a set of margins as input to assessing dis- 
closure risk. Natural extensions of this work involve other types of partial data 

* Currently Visiting Researcher at CREST, INSEE, Paris, France. 



J. Domingo-Ferrer and V. Torra (Eds.): PSD 2004, LNCS 3050, pp. 30-43, 2004. 
© Springer- Verlag Berlin Heidelberg 2004 




Bounds for Cell Entries in Two-Way Tables 



31 



releases such as conditionals, regressions, or any other data summaries. Govern- 
ment agencies in the U.S. are already releasing tables of rates, that is conditional 
relative observed frequencies (for example, see Figure 1). However, not much is 
known on their effect on confidentiality and data privacy. Furthermore, releasing 
of conditional distributions for higher-dimensional contingency tables could be 
useful for researchers interested in assessing causal inference with observed data 
while still maintaining confidentiality. 
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Fig. 1. An example of a published 3-way table with rates from the U.S. Department 
of Labor Bureau of Labor Statistics website. Data are from 2002 CPS supplement. 
Source: Boraas, S. “Volunteerism in the U.S.” Monthly Labor Review, August 2003. 



In this paper we develop nonparametric upper and lower bounds for cell 
entries in two-way contingency tables given an arbitrary set of marginals and 
conditionals for purpose of evaluating disclosure risk. In Section 2, we provide 
technical background and introduce some notation. In Section 3, we discuss the 
complete characterization of the joint distribution and present new results on 
uniqueness of the entries in the table given a marginal and a conditional. In 
Section 4, we estimate the bounds on cell entries via optimizations techniques 
such as linear and integer programming and discuss issues relevant to disclosure 
limitation. For two-way tables we give a complete characterization of the space 
of possible tables given different margins and/or conditionals. We also employ a 
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representation from algebraic geometry to describe the locus of all possible tables 
under the given constraints and discuss possible implications for disclosure. 



2 Technical Background and Notation 

Beyond the disclosure setting, bounds for cell entries in tables arise in a va- 
riety of other statistical contexts. They can be found in mass transportation 
problems (Rachev and Riischendorf [21]), computer tomography (Gutmann et 
al. [16]), ecological inference (King [17]), and causal inference of imperfect ex- 
periments (Balke and Pearl [3]). Most of this work focuses on bounds induced 
by given marginals, with very little attention to bounds induced by sets of con- 
ditionals and marginals. For example, Balke and Pearl ([3]), Tian and Pearl 
([23]), and Pearl ([19]) use linear programming formulations to establish sharp 
bounds on causal effect in experiments with imperfect compliance. These are 
non-parametric bounds limited to a single cause analysis and the cases are lim- 
ited to bivariate random variables. Pearl ([19]) also describes a Gibbs sampling 
technique to calculate the posterior distribution of the causal quantity of inter- 
est, but he points out that the choice of prior distributions in this setting can 
have a significant influence on the posterior. 

While statisticians have long been interested in combining marginal and con- 
ditional distributions, results have been developed mainly for complete speci- 
fication of the joint as in the work of Gelman and Speed ([14,15]), Arnold et 
al. ([1,2]), and Besag ([4]). The Hammersley-Glifford Theorem ([4]) establishes a 
connection between the joint distribution and the full conditionals, and often de- 
scribes conditional statements of an undirected graph. [14] defined conditions un- 
der which a collection of marginal and conditional distributions uniquely identi- 
fies the joint distribution. Their key assumption is positivity in the Hammersley- 
Glifford Theorem (Besag [4]) which, for the discrete case, means that there are 
no cells with zero probability. Arnold et al. ([1,2]) extended the sets defined by 
the Gelman and Speed theorem by relaxing the positivity condition such that 
in the discrete case the sets never involve conditioning on an event of zero prob- 
ability, i.e., appropriate marginal distributions are strictly positive. Arnold et 
al. ([1,2]) also address the question of whether the given set of densities are con- 
sistent (compatible) and whether they uniquely specify the joint density. When 
sets of queries in the form of margins and conditionals satisfy the conditions of 
the Gelman and Speed ([14]) and/or Arnold et al. ([2]), the released information 
will then reveal all of the information in the table since the joint distribution 
will be uniquely identified. This uniqueness poses a problem in tables with small 
counts as it increases a risk of uniquely identifying individuals who provided the 
data. We are interested in identifying such situations as part of a broader goal 
of developing safe tabular releases in terms of arbitrary sets of marginals and 
conditionals that would still allow for proper statistical inferences about models 
for the original table. 

Arnold et al. ([2]) describe an algorithm for checking the compatibility of a set 
of conditional probability densities. In the disclosure-type setting, we know that 
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the unique joint distribution exists, and released conditionals are compatible in 
the sense that they come from the same joint distribution over the space of all 
variables. The issue of compatibility may become relevant if we are to consider 
that an intruder has outside additional knowledge and would like to incorporate 
some prior information with the data provided by the agency. We are interested 
in assessing the uniqueness condition. Arnold et al. ([2]) provide a uniqueness 
theorem (see Section 3) and describe an iterative algorithm due to Vardi and Lee 
(1993) for determining the missing marginal distribution in order to establish the 
joint distribution. In cases where there is more than one solution, the algorithm 
obtains one of the solutions but is unable to detect that there is more than 
one solution. This algorithm is like the iterative proportional fitting algorithm 
for maximum likelihood estimation from incomplete contingency tables (e.g., 
see Bishop, Fienberg, and Holland (1975)), and solves linear equations subject 
to non-negativity constraints and cell probabilities summing to one. While the 
algorithm always converges for compatible distributions, the convergence is quite 
slow even for two-way tables and it is unclear how successfully the algorithm 
deals with boundary cases. 

Algebraic statistics is an emerging field that advocates use of tools of com- 
putational commutative algebra such as Grdbner bases in statistics (Pistone et 
al. [20]). Diaconis and Sturmfels’ (1998) seminal work had a major impact on 
developments in statistical disclosure techniques. Their paper applies the alge- 
braic theory of toric ideals to define a Markov Chain Monte Carlo method for 
sampling from conditional distributions, through the notion of Markov bases. 
Given a set of marginal constraints. Dobra and Fienberg ([6], [9]) and Dobra et 
al. ([10]) have utilized Grobner bases in connection with graphs to fully describe 
the space of possible tables. Most recently, Slavkovic ([22]) described Markov 
bases for two-way tables given the sets of conditionals and marginals, a topic 
closely related to that of the present paper, and to which we return later. 



2.1 Notation 

Let X and Y be discrete random variables with possible values X\,X 2 , ■.., xj and 
j/ 1 , 2 / 2 , respectively. Let denote the observed cell counts in the I x J 

table n. We represent their joint probability distribution as the I x J matrix 
P = (pij) which is the normalized table of counts such that all cell entries 
Pij = P{X = Xi,Y = yj),i = l,2...J,j = 1,2, ...J, are nonnegative and sum 

to 1. Let pt, = Y."j=iPij = and p,j = Y.l=iPij = = Pj) be 

the marginal probability distributions for X and Y , respectively. We can also 
represent the conditional probability distributions as I x J matrices A = (aij) 
and B = (bij) where 

aij = P(X = x,\Y = y,) = ^ = ^, t = 1, 2, ..., J, j = 1, 2, ..., J, (1) 

P.j n,j 

b,,=P{Y = yj\X = x,) = ^ = '^, z=l,2,...,/,j = l,2,...,J. (2) 

Pi. rii. 
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3 Complete Specification of the Joint Distribution 



In the disclosure limitation setting, we say that sets of conditionals and/or 
marginals are compatible if they come from the same joint distribution. But 
when there is compatibility, we want to check if the given set of conditionals 
and/or marginals is sufficient to uniquely identify the joint distribution. We al- 
low cell entries to be zero as long as we do not condition on an event of zero prob- 
ability. Then the uniqueness theorem of Gelman and Speed ([14]) as amended 
by Arnold et al. ([2]) tells us that the joint distribution for any two-way table is 
uniquely identified by any of the following sets of distributions: 

(1) f{x\y) and f{y\x), 

(2) f {x\y) and f {y), 

(3) f{y\x) and f{x), 

(Note that these are equivalent under the independence of X and Y.) 

In addition, Arnold et al. ([1,2]) show that sometimes collections of the type 
{f{x\y),f{x)} or {f{y\x),f{y)} also uniquely identify the joint distribution as 
long as we do not condition on a set of probability zero. When the positivity 
condition on the Cartesian product assumed by Gelman and Speed does not hold, 
Arnold et al. ([2]) suggest that uniqueness can be checked by determining the 
uniqueness of a missing marginal given the provided information. Besides using 
the Vardi and Lee (1993) algorithm, they also suggest running a Markov process 
that generates A’s by cycling through a list of conditional distributions. If the 
process is irreducible, then there exists a unique marginal that together with the 
given conditionals uniquely determines the joint distribution. If we already know 
that we are given proper conditional distributions for tabular data, however, the 
proposed algorithm can be replaced by simply checking the number of levels of 
two variables and the rank of the conditional matrix. For a subset of cases we can 
apply a simple formula for finding the cell probabilities as well, as the following 
results imply. 

Theorem 1. For two discrete random variables, X and Y, either the collection 
Cx = {f{x\y),f{x)'\ or the collection Cy = {f{y\x),f{y)'\ uniquely identifies the 
joint distribution if matrices A and B have full rank and if I > J for Cx or 
J > I for Cy . 

The unique joint distribution in Theorem 1 takes a simple form for the entries 
of the 1x2 table. 

Theorem 2. Suppose that for an I x 2 tables where I >2 we are given f{x\y) 
and f{x). Then the unique probabilities for the cell entries are given by 



Pij — Ojij 



Pi. 



(3) 



Example 1. The data in Table 1 are from a fictitious survey of 50 students about 
illegally downloading MP3s. Suppose that the only information available about 
the survey are P(Download) = {0.4, 0.6} and 
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P{Download\Gender) = {hij} = B = 



0.6 0.4 
0.2 0.8 



Applying Theorem 2, we get the exact cell probabilities in the original table 
{Pii,Ti 2 ,P 21 ,P 22 } = {0.3, 0.2, 0.1, 0.4}. For example, pi 2 = = 0.2. 



Table 1. Fictitious example: Number of students illegally downloading MP3s by gender 





Download 

Yes 


Download 

No 


Total 


Male 


15 


10 


25 


Female 


5 


20 


25 


Total 


20 


30 


50 



Note that these results hold regardless of the value of sample size N . Knowledge 
of the sample size will give us the original (unique) table of counts. 

4 Bounds for Cell Entries in / x J Tables 

Any contingency table with non-negative integer or real entries and fixed mar- 
ginal and/or conditionals is a point in the convex polytope defined by a linear 
system of equations induced by released conditionals and marginals. When the 
table does not satisfy the Gelman and Speed (1993) Theorem or Theorem 1, we 
have the possibility of more than one realization for the joint distribution (A, K), 
i.e., there is more than one possible table that satisfies the constraints implied by 
the margins and conditionals. One way to evaluate the safety of released tabular 
data is to determine bounds on cell entries given the margins and conditional. 
To fully describe the space of / x J tables subject to marginal and conditional 
constraints, we need to consider combinations of marginals and conditionals of 
the following forms: 

1. f{x) or /(j/), 

2. f{x\y) or f{y\x), 

3- {f{x\y),f \x),I < J} or {f{y\x),f{y),J < I}. 

4.1 Linear /Integer Programming Bounds 

The simplest and most natural method for solving system of linear equations is 
the simplex method. Dobra and Fienberg ([9]) discuss some inadequacies of this 
method for addressing bounds given sets of marginals, but this method works 
relatively well for tables with small dimensions, Hence we explore its feasibility 
and usefulness for the other two listed cases. 

Unknown Sample Size N. We first consider a setting where sample size 
N is unknown. It is trivial to see that first two sets do not give us the joint 
distribution. When a single marginal is given, the cell probabilities are bounded 
below by zero and above by the corresponding marginal value, e.g. 0 < pij < pi. 
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(Dobra and Fienberg [8]). When a single conditional distribution is given, the 
cell probabilities are bounded by zero and associated conditional probability; 
for example, 0 < pij < Qij, for f(x\y). This can be verified by setting up a 
linear programming (LP) problem since the observed conditional frequencies are 
a linear-fractional map of either the original cell counts or cell probabilities: 



Max Pij, (4) 

subject to 

^ j 

(1 Ctij^Pij Qij ^ ^ Pkj — 0; ^ ^ Pij ^ bj (6) 

i 

and Pij>0,Vi,j. (7) 



For 2x2 tables, for example, there are four equations of the form (6) but 
linear dependencies reduce the set of these equations to two, each involving two 
conditional values that add to one and their respective Pij entries. In addition, 
the cross-product ratios of all 2 x 2 positive submatrices of A, B and n are equal, 
i.e., 

011022 b\ib22 0117122 

“= = ITAT = ■ 

O12O21 O12O2I 0127721 

(see Bishop, Fienberg and Holland (1975)). We can replace some of the con- 
straints in Equation (6) by a linearized version of the odds ratios to obtain the 
equivalent bounds. 

To complete the characterization for two-way tables, we need to consider 
the third set of combination of marginals and conditionals. For example, for 
{f{x\y),f{x),I < J}, the following linear programming problem produces opti- 
mal solutions: 



Max Pij, 

subject to EE P^3 = 1, 



EPb = 

j 

Qjij — 



Pi.^ 



Pij 



Pij + 'llk^iPkj 
and p,j > 0,V7 = {1, 2, ..., /}, j = {1,2,..., J}. 



( 8 ) 

(9) 

( 10 ) 

V7 = {l,2,...,/},j = |l,2,...,J}, (11) 

(12) 



Theorem 3. For an 2 x J table given f{x\y) and f{x), sharp upper hounds on 
the cell probabilities are given by 



Pi.-max{aife} 



'^‘^3 aij— max 









UB = i 



‘‘ij ay-min{aifc} Pi- ^ ’ 

k¥3 



(13) 



and sharp lower hounds are given by 
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LB = < 



max{0, 



max{0, 



Pi.-min{aifc} Pi.-min{oifc} 

an — T s.t. an — t<UB} if p, > 

aij—nun{aik} aij—nun{aik} — J J — 



Pi.— max{oi)c} 

J. 

aii— max{aifc} 



S.t. a 



Pi.— max{aife} 



Hj, 



(14) 



^3 a^ii —max 



K7T ^ < 



These results also generalize to / x J tables where such that the first 

Pi.— max{oifc} 

part of the bounds formula (e.g. an~ — r if Pi > an) holds, but there 

^ — max'll ^ ' 

is an additional factor that we need to adjust the bounds by. For example, in a 
3x4 table, let k be the index for the conditional value which satisfies max{ajfc}, 

I be the index for the conditional value that satisfies min{aifc}, and r G {I\ i\, 

and let 



, _ ^ij 9ij 
fn-9n' 



where 



Pr. ^rk 

Pi. 



fij — 



CLj^j (^rk 

5 

(^ij ^ik 



Then the solution for the upper bound is: 



9ij — 



arl 

an Pik 



UB = { 



Pi,—max{aik} 

a^ i r r X di i 

an — maxi aik t J 



Pi.— min{aife} 
r, ■ ■ Vi/.. 

aij-mni{aik} 



if 



if 



Pi. aij , 



Pi. < ai 



(15) 



More research remains to be done to understand the structures of these 
bounds and how they extend to k-way tables. 

Given these constraints and the LP approach, in the limit these are the 
sharpest bounds we can obtain. For inferential purposes this may be sufficient. 
But these bounds fail to preserve two conditions (when none of the conditional 
probability values are zero) then: 

1. None of the individual cell probabilities (or counts) can be zero, and 

2. Cross-product ratios of 2 x 2 subtables cannot be zero or infinite. 

Therefore, 0 < pn < an, for f{x\y) does not give sharp bounds for the table 
of counts. Without some other prior information relevant to the observed data, 
however, we cannot do better using linear programming. The example in Section 
4.3 demonstrates how these bounds can lead to a false sense of data security. 



Known Sample Size TV. When we know the sample size TV, we can obtain 
tighter bounds for the cell probabilities, and hence for cell counts, but we are 
also likely to get fractional bounds by applying linear programming methods. 
One way to obtain tighter bounds on cell probabilities is assure nonzero values 
of the entries, e.g., by modifying the constraints in equations (7) and (12) to 
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become pij > However, even this approach is not always sufficient to 

produce the sharpest bounds on the tables of counts and is likely to lead to 
fractional bounds on the tables with integer entries as well. 

When we know N, we can also apply integer programming (IP): 



Max 


'^ij 1 


(16) 


subject to 


= -^5 

i j 


(17) 




(1 Uij)nij a-ij ^ ( n]^j — 0, a^j — ^ 


(18) 


and 


Uij > l,Vi, j. 


(19) 



IP methods such branch-and-bound rely on implicit enumeration of possible 
solutions and will either give the sharpest bounds for the cell counts or will not 
find a feasible solution. For larger tables, however, IP methods can be computa- 
tionally prohibitive. 

Example 2. Suppose that the only information we have about the data from 
Table 1 is that P{Download\Gender) is given by the matrix B. The following 
IP model will give the sharp bounds (see Table 2) for the integer entries of the 
table. 



Max 


‘^ij j 


(20) 


subject to 


nil + ^12 + ^21 + ^22 = 50, 


(21) 




0.4nii — 0.6ni2 = 0, 


(22) 




0.8n2i — 0.2 u22 = 0, 


(23) 


and 


n^j > 


(24) 



Table 2. Data from Table 1 and IP bounds given P{Download\Gender) 





Download 

Yes 


Download 

No 


Male 


15 [3, 27] 


10 [2, 18] 


Female 


5 [1, 9] 


20 [4, 36] 



However, conditional frequencies are often reported as floating point approx- 
imations. This rounding typically leads to infeasible IP solutions, and we need to 
apply linear programming relaxation which in turn may again give us fractional 
bounds that are not tight, as the following example illustrates. 

Example 3. Suppose that the only knowledge we have about the original Table 
1, consists of the conditional frequencies 

P{Gender\D ownload) = Uij = A = 



/0.75 0.33\ 
1^0.25 0.67 ) 
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The left panel in Table 3 gives the LP fractional bounds for this problem. 
There is a significant discrepancy between these upper and lower bounds for 
some of the cells when compared with the sharp bounds provided in the right 
side panel of the same table. Moreover errors due to rounding of the reported 
conditional frequencies would lead to a very different set of bounds. The issue of 
rounding in reporting statistical data has been received very limited attention in 
the disclosure literature and we are just starting to explore its effects on bounds. 

Table 3. Data from Table 1 with LP relaxation bounds given P{Gender\ Download) 
in the left panel and the sharp bounds obtained via Markov basis in the right panel. 





Download 


Download 


Download 


Download 




Yes 


No 


Yes 


No 


Male 


15 [3, 35.25] 


10 [1, 15.33] 


[6, 33] 


[2, 14] 


Female 


5 [1, 11.74] 


20 [2, 30.67] 


[2, 11] 


[4, 28] 



4.2 Markov Bases and Bounds 

We can find feasible solutions to the constrained maximization problem by using 
tools from computational algebra, such as Grobner or Markov bases. A set of 
minimal Markov bases (moves) allows us to build a connected Markov chain 
and perform a random walk over the space of tables of counts that have the 
same fixed marginals and/or conditionals. A technical description of calculation 
and structure of Markov bases given fixed conditional distributions for two-way 
tables can be found in Slavkovic ([22]). 

Consider the information provided in Example 3. The minimal Markov ba- 
sis is represented by the following binomial — nf 27122 ■ This implies two 

possible moves on our 2x2 table of counts. These moves together with the 
sample size N describe the space of tables containing only four possible tables 
with non-negative integer counts (see Table 4). This procedure not only gives 
the sharp bounds listed in right panel of Table 3, but also provides more insight 
into the structure of the table. A number of possible table realizations could also 
be used as a measure of disclosure risk. In this case, the space of tables satisfying 
the constraints is too small and would easily lead to a full disclosure of any of 
the cells, and the whole table. These observations hold for a higher dimensional 
tables as well and could have implications on the assessment of disclosure risk. 
In particular, they suggest that reporting conditional distributions for a subset 
of variables may be essentially equivalent from the bounds perspective to the 
reporting of the corresponding marginal. Although we need to investigate this 
issue further it may be welcome news since the calculation of Markov bases is 
typically computationally very expensive for relatively large multi-way tables. 

4.3 Example: Delinquent Children Data 

In this section we consider a 4 x 4 table of counts used in the Disclosure Limi- 
tation Methodology Report of the Federal Committee on Statistical Methodol- 
ogy ([12]). Titles, row and column headings are fictitious. Table 5 shows number 
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Table 4. The space of contingency tables with non-negative integer entries given fixed 
conditional distribution P(Gender\Download) 





Fi 


Y2 


Total 


Ml 


6 


14 


20 


M2 


2 


28 


30 


Total 


8 


42 


50 





Ml 


M2 


Total 


Ml 


26 


6 


30 


M2 


8 


12 


20 


Total 


32 


18 


50 





Ml 


M2 


Total 


Ml 


15 


10 


25 


M2 


5 


20 


25 


Total 


20 


CO 

o 


50 





Ml 


M2 


Total 


Ml 


CO 

CO 


2 


35 


M2 


11 


4 


15 


Total 


44 


6 


50 



of delinquent children by county and education level of household head. We 
compare outcome of releasing margins versus conditionals on disclosure. First, 
suppose we consider releasing both margins (row sums and columns sums), we 
give Frechet bounds for the counts in this case in the left panel of Table 5. Ob- 
serve that the cells with small counts (i.e., sensitive cells) are well protected. For 
example, cell [Alpha, Medium] with count “1” is bounded below by 0 and above 
by 20. Utilizing tools from computational algebra and Markov bases, one can 
determine that there are 18,272,363,056 possible tables given the fixed margins. 



Table 5. Delinquent children data by county and education level. The left panel con- 
tains the cell counts and the Frechet bounds given the margins. The right panel contains 
the LP relaxation bounds given P{Education\County). 



County 


Low 


Medium 


High 


Very High 


Low 


Medium 


High 


Very High 


Alpha 

Beta 

Gamma 

Delta 


15[0,20] 

20[0,50] 

3[0,25] 

12[0,35] 


1[0,20] 

10[0,35] 

10[0,25] 

14[0,35] 


3 [0,20] 
10[0,30] 
10[0,25] 
7[0,30] 


1[0,20] 

15[0,20] 

2[0,20] 

2[0,20] 


[15,74.6] 

[1.99,30.8] 

[1.5,11] 

[6.02,33.27] 


[1,4.97] 

[1,15.5] 

[5,36.8] 

[7.02,38.8] 


[3, 14.9] 
[1,15.5] 
[5,36.8] 
[3.51,19.4] 


[1,4.97] 

[1.5,23.2] 

[1,7.36] 

[l,5.53j 



Next, suppose that instead of margins we only release conditional frequencies: 



P{Education\County) = 



/0.750 0.050 0.150 0.050 \ 
0.364 0.182 0.182 0.272 
0.120 0.400 0.400 0.080 
\ 0.343 0.400 0.200 0.057 / 



Integer programming has no feasible solution for this problem and we give 
the linear programming relaxation bounds in the right panel of Table 5. The 
bounds are significantly tighter than those associated with releasing the margins, 
indicating higher disclosure risk. An even more surprising result comes from 
calculating Markov bases and using the tools from computational algebra to 
determine the space of tables. In this case, there is only one table with non- 
negative integer entries satisfying the given conditional and the sample size. 
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Hence we have full disclosure of the counts which was masked by the bounds 
obtained from linear and integer programming. In this case, and in most two-way 
table with sensitive cells, it is not safe to release the conditionals as they carry 
almost full information about the table itself. 



5 Conclusions 

To date statistical disclosure limitation methodologies for tables of counts have 
been heavily intertwined with the release of unaltered marginal totals from such 
tables, and methods have focused in part on inferences that are possible by 
an intruder from such releases as well as the use of marginals for inferences 
about underlying relationships among variables that make up the table, e.g., 
based on log-linear models. Many statistical agencies also release other forms of 
summary data from tables, such as tables of rates or observed conditional relative 
frequencies. These are predominantly released as two-way and three-way tables, 
with conditioning on a single variable. Preserving conditionals and marginals 
from a table puts us into a constrained subset of the probability simplex for 
the the space of possible tables. It is only with the knowledge of sample size N 
(or a 1-way marginal in addition to a set of conditionals) that we can calculate 
the bounds on the actual counts. For two-way tables this often leads to full 
disclosure, i.e., the complete specification of the original table. Zero cells in a 
table become zeros again when we condition on one or more of the variables, 
and thus the release of such conditionals reveals extra information about the full 
cross-classification . 

We are just beginning to understand the implications of rounding in the 
construction of conditionals on disclosure limitation. But the work we have done 
to date suggests that in two dimensions the impact of rounding on identification 
of feasible tables is likely to increase dramatically the disclosure of sensitive cells 
in the original table. 

Finally, we note that the work reported here is illustrative of the kinds of 
statistical calculations that are possible in higher-way tables with the release of 
some combination of marginals and conditionals. In some cases the released data 
reduce to a set of marginals and the results of Dobra and Fienberg (2000, 2002) 
then can be used directly. In other cases, the release of a conditional instead 
of a marginal can yield larger bounds and looser inferences about the cells in 
the table by an intruder. We hope to report on extensions of the methodology 
introduced here to the multi-way case in the near future. 
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Abstract. Rounding Methods are perturbation techniques used by sta- 
tistical agencies for protecting privacy when publishing aggregated data 
like frequency tables. Controlled rounding is one of these methods that, 
besides the possibility of publishing clearer rounded tables that still re- 
spect the additivity underlying the original table, also offers an effective 
and powerful way to protect the unsafe data. Due to the lack of au- 
tomatic procedure, it is at present still not used much, and statistical 
agencies apply less sophisticated variants (e.g., random rounding) which 
have several disadvantages. This work presents an automatic procedure 
to apply Controlled Rounding in a simple and friendly way on a per- 
sonal computer. The tool is written in C programming language, using 
Excel SDK-API, and is implemented as an add-in for Microsoft Excel - 
a standard software we chose due to its wide spread on many personal 
computers and because of its familiarity to many users who work with 
tabular material in statistical agencies. The algorithm is based on solv- 
ing an integer linear program in order to find a “good” rounded table 
while ensuring additivity and protection, as recently proposed by the 
first author. An extra feature of the here-presented implementation is 
that it does not need to be linked with a commercial mathematical pro- 
gramming software; better performance on large scale tables would be 
obtained when Xpress or Cplex is available on the computer. 



1 Introduction to Controlled Rounding Methodology 

Before publishing data, a statistical office should apply a methodology to guar- 
antee confidentiality (see, e.g., Willenborg and de Waal ([1996])). Among several 
possibilities, there is a family of methodologies named Roundings that consist 
of slightly modifying the original cell values before the publication. The aim is 
that the final output should induce a certain uncertainty to a possible intruder 
on the unsafe cell values in the original table. More precisely, the ideal situation 
is to publish a table such that no original value can be computed with an accu- 
racy smaller than some a priori protection levels. Clearly, uncertainty is related 
with loss of information, and therefore this aim also includes the wish of con- 
trolling the individual and the overall perturbation added to the original table. 
In addition, the statistical agencies would prefer to publish an output which is 
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also additive (i.e., another table with the same mathematical structure as the 
original one) and which is as close as possible to the original table. 

Since the above definition of “rounding” concerns a complex optimization 
problem in practice, mainly due to the protection and additivity requirements, 
several relaxed methods are being used for that purpose instead. Most of these 
methods forget the additivity requirement (which is a difficult condition on large 
linked tables) and even also the a priori protection levels (which are checked a 
posteriori by solving the so-called auditing phase). The best known of these re- 
laxed methods are the deterministic rounding and the random rounding. The 
Deterministic Rounding consists of replacing each original cell value by its near- 
est multiple of the rounding base. It is a very simple algorithm for finding output 
tables, with the important disadvantage that it is very biased. For example, as- 
suming that the rounding base is 5, a cell containing value 15 in the output 
contains (e.g.) neither 11 nor 12 as the original value. The Random Rounding 
partially solves this disadvantage of the deterministic rounding by introducing 
probabilities when rounding up or down a cell value. More precisely, if [3 is the 
given rounding base, ai is the original cell value, [oij is the bigger multiple of 
(3 not larger than Oj, and [oi] is the smaller multiple of (3 not smaller than ai, 
then the random rounding procedure will replace Ui by [oi] with probability 
Pi := (oj — [ai\)/(3, an by [aj with probability 1 — Pi. See Figure 1 for an 
example of output tables using these methodologies. 

Unfortunately, none of these relaxed methods take into account the additivity, 
and indeed the output tends to contain marginal cells which are not consistent 
with the sum of the associated internal cells. Also, the protection is assumed 
to be provided implicitly by the perturbation process but, since it is not guar- 
anteed, then the protection of the rounded table should be explicitly checked 
by solving the so-called auditing phase before released. This problem involves 
the computation of the worse-case limits for each unsafe cell, and it is quite a 
time-consuming task. These negative behaviours are undesirable features, ac- 
cording to several experts in statistical agencies, and they should be avoided. At 
present the main reason to still go on publishing this type of undesirable tables 
in many statistical offices is the lack of an automatic procedure that perturbs 
the original data whilst keeping the additivity. The main contribution of this 
article is to close this gap by presenting a simple automatic tool that would help 
a user to achieve his/her aim on a personal computer. Preliminary experiments 
on real-world data (see Appendix) are showing that the underlying tool is able 
to find control-rounded tables when the original table consists of 220,040 cells 
and 75,572 equations, and with 431,613 cells and 96,720 equations, in seconds 
of a standard personal computer. 

This article is organized as follows. Section 2 introduces the required notation 
and provides the main features of the controlled rounding methodology. Section 
3 summarizes the mathematical background of the automatic procedure, which 
is briefly described and illustrated in Section 4. The article ends with some 
remarks. 
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Region 

total 

ABC 



activity 1 


23 


48 


10 


81 


activity 2 


8 


19 


22 


49 


activity 3 


17 


33 


12 


62 


total 


48 


100 


44 


192 



(a) Original Table 







Region 




total 


A 


B 


C 


activity 1 


20 


50 


10 


80 


activity 2 


5 


15 


25 


50 


activity 3 


15 


35 


10 


60 


total 


45 


100 


45 


195 



Region 

total 

ABC 



activity 1 


25 


50 


10 


80 


activity 2 


10 


20 


20 


50 


activity 3 


15 


35 


10 


60 


total 


50 


100 


45 


190 



(b) Nearest integer multiple 
rounded 







Region 




total 


A 


B 


C 


activity 1 


25 


45 


10 


80 


activity 2 


10 


20 


20 


50 


activity 3 


20 


35 


15 


60 


total 


45 


100 


45 


190 



(c) Randomly rounded 



(d) Rounded with controlled 
rounding 



Fig. 1. 2-dimensional table of data for three regions and three activities with marginals. 



2 Controlled Rounding 

Let’s denote the table we want to round by a vector a := [a^ : i G /], where the 
Oi {i G I) represents the cell with index i in the table, and / the set of all indices. 
Denoting by My = d the set of equations linking the cell numbers of each table, 
clearly a is a solution of this linear system. Note that M and d are given by the 
structure of the table, independently of the numbers a^, and typically each row 
of M is a vector of coefficients, all of them being -1-1 but one (the associated with 
a marginal cell) which is —1; vector d is typically a zero vector. For example, the 
table (a) in Figure 1 consists of 16 numbers (one for each cell) and 8 equations 
(one for each column and each row). When applying rounding, a base number 
must also be given and it is denoted here by p. Given a value a^, for convenience 
of notation, we will denote [a^J the largest multiple of /3 smaller than or equal 
to Oi, and [oi] the smallest multiple of /3 bigger than or equal to a^. Rounding 
a table a to a base P means in the strict sense that we replace (and therefore 
perturb) the values of the cells by either [a^J or [oi]. The rounded table will 
be denoted by a vector a whose elements satisfy hi G {[flij, [oil} for all i G I. 

When applying the deterministic rounding procedure, the rounded table is 
simply defined by: 
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[a*] if 2oi > [a* 

[cij otherwise 

while the rounded table by using random rounding is determined by: 

[a*] with probability Pi := (a* - [a* J)//3 for alH G / (2) 

[ttij with probability 1 — Pi 

As mentioned, a bad feature of these two simple procedures is that the output a 
is not a vector in the linear space {y : My = d}, i.e. a is not an additive table. 
Controlled rounding is a more sophisticated procedure where the decision on 
each cell should allow to find a new table a which must be additive (e.g., it must 
be also a solution of the linear system of equations My = d). A result of the 
controlled rounding procedure for table (a) in Figure 1 is shown in table (d). 

Clearly, finding an additive table a requires the use of a more sophisticated 
algorithm, based on mathematical models to manage the use of the constraints. 
What is more, an algorithm to solve the model requires greater computational 
effort than simply using (1) or (2). Still, as is described in Salazar-Gonzalez 
([2002b])), it is possible to implement an algorithm which has very good perfor- 
mance in practice on large scale tables. Therefore, as it will be discussed later, 
the complexity of the methodology is not a real disadvantage considering the 
modern technology in Computer Science and in Operations Research. 

The underlying mathematical problem of finding a controlled rounded ta- 
ble has existed for many years (see e.g. Bacharach ([1966])), and it has been 
a desirable technique to protect private information in statistical agencies (see 
e.g. Nargundkar and Saveland ([1972]) using random rounding, and Cox and 
Ernst ([1982]) using controlled rounding). The problem was observed to be diffi- 
cult (see Kelly, Golden, Assad and Baker ([1990b])) and possibly even infeasible 
My = d is a 3-dimensional structure (see Causey, Cox and Ernst ([1985])). Since 
then much work has been done (see e.g. Fischetti and Salazar ([1998]) for other 
references). Very recently Salazar-Gonzalez ([2002a, 2002b])) has introduced a 
new mathematical programming model that guarantees protection (thus the so- 
called auditing problem is unnecessary) and finds an approximate solution when 
the problem is infeasible. An important advantage of the approach is that it 
deals with all type of tables (including fc-dimensional, linked and hierarchical 
tables) and it optimizes an objective function that allows a user to steer the 
selection of the solution when several exist. Part of this algorithm is embedded 
in the here-proposed tool. The model and the algorithm solve successfully the 
two inconveniences of the controlled rounding methodology: first, it extends the 
definition of a classical controlled rounding method (also named zero -restricted 
version)] second, it is modelled and implemented by using modern techniques 
in Integer Linear Programming, thus a protected and good rounded table is 
guaranteed in short computational time for most medium-size tables. 

An interesting advantage of using a controlled rounding is that the produced 
output table has an overall difference from the original table (here termed as 
net perturbation) which is smaller than when using other relaxed (and simpler) 




J + [Oil 



for all i £ I, 
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rounding variants. By using deterministic rounding each cell can be perturbed 
in about 13/2 units, in the worse case. By using random rounding, the worse-case 
analysis shows that (3—1 units of perturbation are possible on a cell. Since all 
the rounding decisions are done independently over all the cells, then we do not 
only have a high probability of losing the additivity but also a high probability of 
inserting too much perturbation to the output table. In fact, when the rounding 
methodology consists of a simple decision rule that is done on each cell (as in 
deterministic and random rounding) then the net perturbation introduced can 
be of an order of magnitude similar to the number of the cells times (3. This is 
not the case when using controlled rounding because the equations guarantee a 
much smaller perturbation. For example, using controlled rounding, not all the 
cells in a non-trivial equation can be rounded up without losing the additivity. 
To be more precise, we define 

P(a)a) := ^(oi - a*), (3) 

iei 

the net perturbation between the rounded table a and the original table a. 
Then the following result shows that, when using controlled rounding, the net 
perturbation is directly dependent on the dimension of the table and on the 
rounding base, but independent of the number of cells of the table. The same 
good result does not hold when using random or deterministic rounding as next 
presented. An important observation to simplify the proof of the result is to 
notice that all tables (including linked and hierarchical) can be seen as part of 
a /c-dimensional table where each equation is associated to one marginal cell. 

Lemma 1. Let a = [ai : i € I] be a k-dimensional table, and let Uq be the grand 
total value. Then = 2^ao- 

Proof. It is based on induction over the dimension k. The result is immediate 
for k = 1. Then, to do the induction step, we observe that a fc-dimensional 
table consists of adding a set of n subtables to define another subtable, all these 
subtables with dimension fc — 1. Let be the set of cells of each subtable 
(^ G {0, 1, ... , u}), where Iq represents the cells in the marginal subtable and 
Ii,. . . ,Ii, the internal subtables. Value is the grand total of the /xth subtable. 
Applying the hypothesis: 

ly-i 1/-1 

Oi = y]] y]] at = 2^ -1- 2^ = 

i^I fi—1 

iy-1 

= 2'=-! “0,. + 2'^-^ao, = 2'=-iao, -k 2'=-iao, = 2'=ao. = 2'=ao. □ 

Corollary 1. Let a = [at : i € I] be a k-dimensional table, and let oq be the 
grand total value. The net perturbation of a controlled rounding solution is given 
by: 



p(a,a) := 2'=(ao - ho)- 
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It is important not to confuse (3) with a function to measure the distance 
between the original table a and the rounded table a, which in many practical 
cases is: 

(5(a,a) := ^ |oi - Oil- (4) 

iei 

By comparing the rounded tables obtained from the three methodologies (ran- 
dom, deterministic and controlled rounding) then the one a with minimum dis- 
tance <5(a,a) is clearly the deterministic rounding. Random rounding, instead, 
tends to produce solutions which are much further from the original table. Still 
this is not a disadvantage of random rounding when compared to deterministic 
rounding, since as compensation the random rounding solutions are unbiased, 
as next described. 

Briefly, the bias of a methodology is the probability that a given rounded 
value has been obtained from a given original value. The best case occurs when 
a value goes to \ai\ with probability Pi, as defined in (2), and it is said to be 
unbiased. Clearly, neither deterministic rounding nor controlled rounding (with 
the standard settings) are unbiased. Cox ([1987]) gives a constructive algorithm 
for achieving unbiased controlled rounding for 2-dimensional tables, but other 
approaches for more complex tables did not exist until the proposal of the math- 
ematical model introduced by Salazar-Gonzalez ([2002b]), which is the one used 
here. Indeed, this proposal gives the user additional input parameters that can 
control the reduction of bias through controlled rounding. More precisely, the 
objective function minimized is transformed into a more general function 

WiX^ (5) 

i&I 

where Xi is a decision variable that assumes value 0 if Oi should be rounded 
down or 1 if ai should be rounded up, and where Wi is a weight introduced by 
the user to the optimization problem. In this problem, a value to each variable 
Xi must be found such that the rounded table is additive and protected. The 
input parameters Wi are very important to steer the selected solution among all 
the feasible ones. Indeed, by considering 



Wi := \ai\ -I- [oi] - 2ai 

then minimizing (5) is equivalent to minimizing (4). Notice that Wi is the rela- 
tive cost (if positive) or profit (if negative) of rounding up {xi = 1) respect to 
rounding down {xi = 0) a given cell value a^. 

Other settings of Wi are also possible, and indeed if another table a' is given 
then it is possible to use the mathematical model and algorithm proposed in 
Salazar-Gonzalez ([2002b]) to minimize the distance <j(a', a) between the rounded 
table and the desired table. To illustrate the utility of this option, observe that 
the random rounding solution a' can be easily found and has the advantage of 
being unbiased, but it has the big disadvantage of being non-additive. Then it 
makes sense to look for a rounded table a which is additive and as close as 
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possible to the random rounded solution a'. This can be done immediately by 
simply generating the associated weights 



Wi 



[oij + \ai\ — 2tti with probability pi 
2ui — \cLi\ — [oj] with probability 1 — Pi 



for all t G /, 



and by applying the optimization algorithm. 

Clearly, other criteria to set up a desirable objective function (5) are also 
possible. In all cases, it is important to note that the final rounded table gener- 
ated by the algorithm will be additive and protected, and the criterium defining 
the weights steer the optimization when selecting the solution among all the 
protected and additive tables. For example, the above random way of defining 
Wi will drive the optimization algorithm to find the controlled rounded table 
closest to the random rounded table, and so possibly also with the advantage 
of being unbiased. Indeed, the standard controlled rounding with the classical 
objective function of minimizing (4) looks for a as close as possible to the orig- 
inal table a, exactly as the deterministic rounding is doing, thus the generated 
(non-additive) table is biased. By setting up the input parameters Wi (possibly 
in a random manner) it is possible to steer the controlled rounding algorithm to 
compute controlled rounded tables closer to other (possibly non-additive) tables 
with good statistical properties. In this way, the computed controlled rounded 
tables will be not only additive and protected, but also they will have good 
statistical properties. 

Another important advantage of the algorithm implemented in this work is 
that it allows all type of table structures to be introduced. Indeed, the depen- 
dencies do not have to be limited to sums of internal cells adding to a marginal 
cell. The model and the algorithm can work with all types of linear dependencies 
between cells, as for example oi -I- 5 o 2 = 2a^ + 804 -I- 05 . The only requirement 
is that they must be linear equations linking cell values of the table. By adding 
these constrains to the description of the table, the result of the algorithm will 
also respect these more general dependencies or linkage between cells. More pre- 
cisely, as becomes clear in the next section, our implementation is able to process 
all kind of Excel sum formulae that can be introduced to a cell. These are formu- 
las that do not contain any coefficients such as =SUM(A1 :A10;B2;C3; ’Another 
Sheet’ !A1). 



1 7 


8 


0 6 


9 


;0' 1 ^07 


8^ R) 


4 4 


8 


3 3 


9 


4 ^ J 4 


8^ R) 


5 11 


16 


6 12 


15 


11 12 43 


44 45 16 



(a) Original table (b) Rounded table (c) Recovering the original 



Fig. 2. Checking the protection of a random rounded 2-dimensional table with /3 = 3. 
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Last but not least, it is fundamental to notice that the model and algorithm 
presented in Salazar-Gonzalez ([2002b]) can implicitly guarantee protection lev- 
els while searching for a controlled rounded table. This is another very important 
advantage that is not shown when random and deterministic rounding is applied. 
Consider, for example, the one-way table (a) and the deterministic rounding ta- 
ble (b) when /3 = 3 in Figure 2. By simply enumerating the possibilities, as 
pointed in (c), an intruder can derive (a) from (b). See e.g. Armitage ([2003]) for 
other weaknesses when checking the protection of random rounding tables. To 
required pre-specified protection to the controlled rounded table one must define 
the set of unsafe cells and give upper, lower and protection-level requirements to 
each one. The mathematical model and computer algorithm used in our imple- 
mentation also allows us to find a rounded table which is protected against some 
additional a prior information termed external bounds. The meaning of these 
input parameters are extensively defined in many articles in literature (see, e.g., 
Salazar-Gonzalez ([2002a])), and will not be repeated here. 

Finally, a feature of fundamental importance in the mathematical formulation 
of the controlled rounding problem is the way infeasible problems are dealt with. 
Until now we have assumed that the controlled rounding procedure will not 
modify the original cell values, i.e. di = at, when they are multiples of the 
rounding base, and the other cell values have two options, hi € {[ui], [oi]}. This 
version is termed zero -restricted controlled rounding, and it has the disadvantage 
that a solution may not exist, as noticed by Cause, Cox and Ernst ([1985]) for 
3-dimensional tables. If no a priori protection levels and external bounds are 
required then they also demonstrated that the problem always admits a solution, 
but when protection levels and/or external bounds are required to the problem 
then it can result infeasible. To avoid this situation Salazar-Gonzalez ([2002b]) 
proposed an extended variant where each cell variable has more than two options. 
Indeed, the algorithm for solving the extended model asks the user for a new 
parameter that will define the number of options allowed for each original cell 
value Oi. To simplify the input of the algorithm, the current implementation asks 
for only one input parameter 7 which is used to limit all the options associated 
to a cell value ai in the following way: 

h^ G {[oi] - 7/3, [ad - (7 - l)/3, • • • , + (t - l)/3, \a^^ -k 7/3}, 

if 7 > 0. For 7 = 0 we get exactly the zero-restricted version of the controlled 
rounding. These additional options are also associated to costs in the objective 
function (4), which are also converted by the model in the objective function (5) 
by adding new integer variables xf and x~ . The final rounded table, if any, is 
then defined by 

hi := [oij -k fdxi -k (dxf -k j3x~ . 

The existence of the new additional integer variables enlarge the space of so- 
lutions by allowing new (additive and protected) non-zero-restricted solutions. 
Since the space is larger, it is not immediate that the new solutions can be found 
in a more simple way, thus it is very important to first try with small values of 
7 . The model can also be adapted to work with different rounding bases Pi for 
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Fig. 3. The menu of the controlled rounding tool. 



different cells (which is useful for hierarchical structured tables where one wants 
small rounding bases in the lower- level cells) . See Salazar-Gonzalez ( [2002b] ) for 
further technical details. 

3 Using the Tool 

The algorithm has been implemented in a friendly automatic tool to simplify 
application of the controlled rounding methodology to tabular data in practice. 
Since most of the people working with data in statistical offices are very familiar 
with Microsoft Excel, we have decided to use this software as a graphical inter- 
face to the algorithm. In a similar way, the algorithm could be embedded in other 
frameworks. More precisely, our implementation is an add-in tool for Microsoft 
Excel, and therefore it is a xll file that can be easily activated by the users. 
We have produced three different implementations by considering three libraries 
solving the linear programming problems. The first one is a non-commercial 
linear programming library which is fully inside our xll file, and therefore it is 
the simplest way of running the controlled rounding tool. The other two imple- 
mentations allow for the replacement of our modest linear programming solver 
by more robust commercial solvers: Xpress by Dash-Optimization ([2003]) and 
Cplex by Hog ([2003]), respectively. By using one of these commercial solvers, 
the performance of the rounder gets better on large scale tables. 

In all cases, it is required to have a version of Microsoft Excel and a table 
to round in a workbook (no matter if the table is split or not in different work- 
sheets). A recommended way of providing the table structure My = b to the 
algorithm is by using the MS-Excel style for inserting a formula in each cell. 
(Remind that a table is not only a collection of numbers, but also a collection of 
linear equations linking these numbers.) This MS-Excel feature allows at most 
one equation for each of the marginal cells, but in some special situations a par- 
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Fig. 4. The extended main dialog. 



ticular table can require additional equations (for example, to guarantee that 
two different subsets of cells produce the same addition). Our implementation 
(see Figure 3) asks for the manual insertion of new equations, and the basic op- 
erations with the mouse under MS-Excel can also be used to simplify this task. 
Further functionalities for removing and adding new equations are available in 
a friendly style. At any time, the number of equations and the number of cells 
(involved in equations) are displayed so the user is informed about the size of the 
table to be solved by the controlled rounding optimizer. The user has also the 
possibility of adding protection levels and external bounds if desired. By default, 
the protection levels are zero, and the lower and upper external bounds are fixed 
to the original value plus and minus, respectively, two times the rounding base. 
These default setting can be easily changed, either editing each cell individually 
or by choosing some pre-specified common rules. See Figure 4. 

Once the table input parameters are decided, then the optimization algorithm 
starts looking for a protected pattern. If a solution is found within the time limit, 
the tool prints on the screen the largest difference between the unrounded and 
the rounded values over all the cells: 

Maximum jump := max{|ai — Ui\ : i € /}. 

The algorithm starts with 7 = 0 and increases 7 gradually until a rounded table 
is found or the time limit is achieved. The maximum computing time allowed and 
the maximum 7 value to try are input parameters that can be inserted by the 
user. The solution is printed in another MS-Excel workbook following exactly 
the same format (including colours, comments, fonts, etc) as the input table. 

Even if the algorithm guarantees implicitly the protection level requirements, 
the user can require the resolution of the so-called auditing phase, which consists 
of computing the maximum and minimum possible original value for all the cells 
according to the rounded table. Since this implies solving a lot of optimization 
problems (two for each cell) , it is only recommended to activate it when working 
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Fig. 5. Original table, controlled rounding table and disclosure intervals (/3 = 10). 



on a small table. The output is another MS-Excel workbook containing all the 
protected ranges (see Figure 5). 

The appendix describes other possibilities of loading and saving tables in the 
tool, and points out the results of some preliminary experiments. 
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Appendix 

Loading and Saving Tables within the Tool 

As mentioned above, the recommended format for providing tables for input into 
the automatic controlled-rounding procedure is using the standard MS-Excel 
worksheet. The internal cells defining the table must contain the unrounded 
values Qi and the marginal cells should be computed by using the MS-Excel 
formulas. Notice that, for example, to introduce the table (a) in Figure 1, only 7 
equations over the four rows and the four columns are linearly independent, and 
indeed it is possible to assign one to each marginal cell. Clearly, the rounded table 
will also satisfy the 8 equations since the removed one is a linear combination of 
the other. 

Because the location of the cells in the MS-Excel workbook is implicitly given 
by the number of worksheet, number of row and number of column, then it is very 
easy to display the rounded table (and the protected intervals) in a similar way 
on a new workbook. Then one can work (save, copy, etc.) with these workbooks 
in the MS-Excel standard way. 

Additionally, the tool also gives the option of saving a table in two special 
formats (see option “Save instance...” in Figure 3). A first file format is known as 
jj -format and it is currently used in r- ARGUS, a very sophisticated program to 
produce and protect tables from microdata in statistical offices (see Hundepool 
([2002])). This software is becoming very relevant in the area of Statistical Data 
Protection and has the option of saving statistical tables within this format. 
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Basically it consists of an ASCII file with two set of lines: the first set provides 
the different input parameters of each cell (cell value, status, weight, external 
bounds, protection levels, etc.) and the second set of lines defines the linear 
equations linking the cells. A second file format is named crp-format, and it 
is similar to the previous one but each cell is also provided with 3 additional 
numbers, representing coordinates in a 3-dimensional space to display the table 
in Excel. Indeed, for each cell, the first coordinate describes the number of the 
worksheet, the second describes the row in the corresponding worksheet, and 
the third is the column. By using this extended format file the tool can save 
and load tables which are displayed on the computer screen by using the MS- 
Excel worksheet style. Without such coordinates for each cell (as happens in 
the jj -format) the cells can only be loaded in the computer memory but not 
displayed, thus the associated controlled rounding problem can be solved and 
the solution saved in jj -format but not displayed in an MS-Excel workbook. 
See Figure 6 for an example of crp-format file. 

-2 

16 

0 cell0_10_6 192.000000 0 10 6 

1 cell0_10_5 44.000000 0 10 5 

2 cell0_10_4 100.000000 0 10 4 

3 cell0_10_3 48.000000 0 10 3 

4 cell0_9_6 62.000000 096 

5 cell0_9_5 12.000000 095 

6 cell0_9_4 33.000000 094 

7 cell0_9_3 17.000000 093 

8 cell0_8_6 49.000000 086 

9 cell0_8_5 22.000000 085 

10 cell0_8_4 19.000000 084 

11 cell0_8_3 8.000000 083 

12 cell0_7_6 81.000000 076 

13 cell0_7_5 10.000000 075 

14 cell0_7_4 48.000000 074 

15 cell0_7_3 23.000000 073 

7 

04: 12 (-1) 13 (1) 14 (1) 15 (1) 

04: 8 (-1) 9 (1) 10 (1) 11 (1) 

04: 4 (-1) 5 (1) 6 (1) 7 (1) 

04: 3 (-1) 7 (1) 11 (1) 15 (1) 

0 4: 2 (-1) 6 (1) 10 (1) 14 (1) 

0 4: 1 (-1) 5 (1) 9 (1) 13 (1) 

04: 0 (-1) 4 (1) 8 (1) 12 (1) 



1 s 182.000000 202.000000 
1 s 34.000000 54.000000 
1 s 90.000000 110.000000 
1 s 38.000000 58.000000 
1 s 52.000000 72.000000 
1 s 2.000000 22.000000 
1 s 23.000000 43.000000 
1 s 7.000000 27.000000 
1 s 39.000000 59.000000 
1 s 12.000000 32.000000 
1 s 9.000000 39.000000 
1 s 8.000000 28.000000 
1 s 71.000000 91.000000 
1 s 0.000000 20.000000 
1 s 38.000000 58.000000 
1 s 13.000000 33.000000 



0.000000 0.000000 0.000000 

0.000000 0.000000 0.000000 

0.000000 0.000000 0.000000 

0.000000 0.000000 0.000000 

0.000000 0.000000 0.000000 

0.000000 0.000000 0.000000 

0.000000 0.000000 0.000000 

0.000000 0.000000 0.000000 

0.000000 0.000000 0.000000 

0.000000 0.000000 0.000000 

0.000000 0.000000 0.000000 

0.000000 0.000000 0.000000 

0.000000 0.000000 0.000000 

0.000000 0.000000 0.000000 

0.000000 0.000000 0.000000 

0.000000 0.000000 0.000000 



Fig. 6. A crp-format file. 



Preliminary Computational Results 

The tool has been tested on a very large family of artificial tables distributed 
in different MS-Excel worksheets and with different features. It has found an 
optimal zero-restricted controlled-rounding table when it exists. We have also 
tested the extended model (7 > 0) on the infeasible 3-dimensional table pointed 
out in Cause, Cox and Ernst ([1985]), and a non-zero-restricted solution was 
found. We used a personal computer with a Pentium IV 2400 Mhz., and in all 
the cases (including 2-dimensional tables with 5000 x 25 cells) the time to find 
the control-rounded table was less than one minute. 

We have also conducted experiments based on real-life problems. A first 
dataset was provided by David Brown (ONS, London), and was a three di- 
mensional table taken from already published data. In particular, it consists of 
six 2-dimensional tables (one representing a local authority district) plus the 
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total local (representing a county). The table was very simple and therefore all 
the equations in the MS-Excel worksheet were manually introduced. An optimal 
table was found by the tool in less than a second. 

A second dataset consisted of 431,613 cells and 105,960 equations derived 
by the 2001 Scottish Census of Population. The raw data was already protected 
by other pretabulation techniques, but still it is an interesting benchmark in- 
stance due to the equations linking the cells. More precisely, it is a frequency 
table grouped by age (20 categories), sex (2 categories), living arrangements (7 
categories) and wards (1176 categories) which, in turn, are grouped by local au- 
thorities, and all adding to one country. The whole hierarchical structure creates 
a very complicated mathematical problem to the control-rounding optimizer, 
and indeed the optimal solution of the full table was not found within two days 
of computation. Still, by relaxing the 9240 equations linking the 1176 wards to 
their local authorities and the local authorities to the country, then the optimiza- 
tion problem with 431,613 cells and 96,720 equations was solved by less than 30 
seconds by the controlling rounding algorithm. Clearly the obtained solution, 
even if zero-restricted, is not additive for the full 105,960 equations. Still users 
can find these types of approximated solutions very useful in practice. Future 
work will try to find solutions for those difficult instances by automatically de- 
tecting minimum sets of complicated equations to be first relaxed, and second 
incorporated in the objective function for minimizing the damage. Indeed, these 
types of challenging instances are stimulating new and interesting research into 
the design of an algorithm for finding almost-additive zero-restricted solutions 
in a short computational time. 

A third real-world dataset is obtained under a confidential agreement by the 
“Department of Work and Pensions” (ONS, London). It consists of a collection 
of subtables, one for each geographical region: 10524 wards, 408 local author- 
ities, 55 counties, 11 government office regions, 3 countries, 1 kingdom. Each 
individual subtable is a linked table from a 4-dimensional table containing 20 
grouped categories. The whole table consists of 220,040 cells and 75,572 links, 
and the internal structure My = d proved to be very simple for the controlled 
rounding algorithm. Indeed, an optimal zero-restricted controlled-rounding table 
was found in less than 30 seconds. 

We hope that the current article will stimulate users in statistical offices 
to run the tool for applying controlled-rounding methodologies, thus providing 
further benchmark results that will help improving the code in future releases. 
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Abstract. This paper describes computational experiments with an algorithm 
for control-rounding any series of linked tables such as typically occur in offi- 
cial statistics, for the purpose of confidentiality protection of the individual con- 
tributors to the tables. The resulting tables consist only of multiples of the 
specified rounding base, are additive, and have specified levels of confidential- 
ity protection. Computational experiments are presented demonstrating the con- 
siderable power of the program for control-rounding very large tables or series 
of linked tables. Heuristic approaches to problematic cases are presented, as are 
procedures for specifying the input to the program. The statistical properties of 
the rounding perturbations are described, and a method of overcoming statisti- 
cal bias in the rounding algorithm is demonstrated. 



1 Introduction 

Rounding techniques involve the replacement of the original data by multiples of a 
given rounding base. Rounding (e.g. to the nearest integer) has been used in science 
for presentational purposes for many centuries. Random and deterministic rounding 
has been used as a confidentiality protection tool by national statistics instititues in 
tabular data for decades (see e.g. Nargundkar and Saveglund [11], Ryan [14], Willen- 
borg and de Waal [15]). Unfortunately naive rounding frequently destroys additivity 
in tables. Controlled rounding has the desirable feature that the rounded tables are 
additive i.e. the values in the marginal cells coincide with those calculated by adding 
the relevant interior cells. As a general numerical technique, controlled (or matrix) 
rounding is not new: solutions were provided by Bacharach [1] as early as 1966. Con- 
trolled rounding was, however, developed and promoted as a serious technique for 
official statistics by Cox and coworkers in the 1980s (Cox and Ernst [3], Causey et al 
[2], Cox [4]). Further work on computational aspects was done by other workers 
(e.g. Kelly, Golden, Assad and Baker [8,9,10]). 

One difficulty discovered by Causey et al [2] was the existence of three dimen- 
sional tables for which classical (zero-restricted) controlled rounding was impossible. 
In zero-restricted controlled rounding, cell entries that are already a multiple of the 
rounding base are not changed, and other entries can only move to an adjacent multi- 
ple of the rounding base. Fischetti and Salazar- Gonzalez [6] overcame this problem 
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for three- and four-dimensional tables by slightly relaxing the zero-restriction. How- 
ever, until very recently (Salazar-Gonzalez [12,13]), controlled rounding has not been 
practical as a standard technique for confidentiality protection of tables in official 
statistics primarily because of the difficulty of finding a control-rounded version of 
any set of linked tables that occur in official statistics. This paper describes initial 
experiments with a controlled-rounding tool that in theory at least can provide con- 
trolled rounding for any such table or set of tables. Some problems and possible solu- 
tions are also discussed. 

Section 2 starts with an introduction to the specification of frequency tables re- 
quired to apply the method: frequencies in cells and equations specifying the links 
between internal and marginal cells. The general principles of the method follow, 
particularly emphasising the form of the algorithm, its computational implementation, 
and the specification of protection levels, external bounds and other aspects. Some 
numerical experiments demonstrating the great power of the method for dealing with 
large collections of linked tables in official statistics are described in section 3. 
Methods of overcoming implementation difficulties that occurred in some early ex- 
periments are discussed in section 4, and other means of steering the technology to 
provide results useful in official statistics are described in section 5. The paper ends 
with some general conclusions and recommendations. 



Table 1. The number of income support claimants in area X from administrative data. 





Female 


Male 


Total 


Under 20 


0 


3 


3 


20-39 


10 


12 


22 


40-59 


5 


10 


15 


60 and over 


32 


12 


44 


Total 


47 


37 


84 



2 Methodology 

We summarise in this section the basic ideas underlying the algorithm for carrying out 
controlled rounding of a frequency table. See Salazar-Gonzalez [13] for a more de- 
tailed description of the optimization problem underlying the methodology. 



2.1 Preliminaries - Specifying a Freqnency Table 

A frequency table is a collection of frequencies (i.e. non-negative integers), and a 
specification of the structure (a set of mathematical equations linking the frequencies, 
e.g. a marginal column frequency equals the sum of the frequencies in the column). 
Table 1 is an example similar to many found on the UK Neighbourhood Statistics 
web site (http:www.neighbourhood.statistics.gov.uk). Thus Table 1 can be represented as 
a collection of non-negative integers, a = [a,: iel ] where a=0, a=J>, a =3, a =10, 
a =12, a =22 , ..., a^^=44, a^^=41, a^=31 , a^=S4, and a structure Ma = 0 i.e 
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( 1 ) 



It can be seen that the first of these equalities (i.e. 0 + 3 - 3 = 0) merely states that 
the ’male’ and ’female’ frequencies for age ’under 20’ add to the marginal total fre- 
quency for age ’under 20’. 



2.2 Principles of the Method 

The process of controlled rounding involves finding a table, y, where all the elements, 
y,, iel, are integer multiples of the specified rounding base, (3, and also satisfy the 
constraints Mj=0 satisfied by the original table. That is to say the interior cells of the 
rounded table add to the marginal cells of the rounded table, just as in the original 
table. The methodology outlined in this paper involves finding such a table that: 

(a) provides adequate confidentiality protection to the original table, 

(b) is, in some sense, close to the original table, and 

(c) where the rounding process has ’good’ statistical properties. 

The problem with any perturbation process, including controlled rounding, is that 
the resulting table is distorted from the original. It is obviously important to minimise 
the distortion, but that is not the only requirement. We need to do it in such a way 
that the table resulting has adequate confidentiality protection, and can be analysed 
for various statistical purposes without too many complications. The use of the term 
’good’ in (c) means that the resulting table is useful for all required statistical analyses, 
as will be discussed later. 

Fischetti and Salazar-Gonzalez [6] describe an efficient computational scheme for 
finding an optimally rounded table for any three- and four-dimensional table. Sala- 
zar-Gonzalez [12,13] extend this methodology to any collection of linked tables. 
Novel features of this extended method are that it can deal with all types of table 
structures (including linked and hierarchical tables), and, perhaps more importantly, it 
ensures that specified confidentiality protection levels are met in the rounded table. It 
also allows restrictions on subsets of the rounded frequencies to be incorporated, e.g. 
the rounded frequencies of some cells can be constrained to take values previously 
published. We first describe the method in its simplest form (in the context of fre- 
quency tables described in section 2.1) without specified confidentiality protection or 




Getting the Best Results in Controlled Rounding with the Least Effort 61 



external bounds. How to incorporate these important features into the method is cov- 
ered in sections 2.5 and 2.6. 

Controlled rounding the above table involves two aspects: finding (1) a feasible ta- 
ble, (2) that is close to the original. Feasibility is defined hy the following statement. 



A feasible table is a set of integer frequencies, y = (y^,y^,yy..y^ that satisfy 
the conditions: 

Condition 1 (additivity), y is subject to the same equality and inequality 
linear constraints satisfied by the frequencies in the original table, viz My = d; 

Condition 2 (rounding), y. g { La,J , fa,!} where LaJ is the largest multi- 
ple of P less than or equal to and f fl,l is the smallest multiple of P greater 
than or equal to a.. 



( 2 ) 



Then we need to select from all the feasible tables, if any, one that matches our se- 
lection criteria. The main criterion used is closeness to the original values, a. Close- 
ness between the rounded and original values can be quantified in many ways, but in 
the current application we use 



^(y,a) = ^w,. I 

iel 



(3) 



where w, is a weight (usually equally to 1) that allows greater importance to be at- 
tached to certain elements of the table if required. For simplicity in our presentation 
here, we assume that w,=l for all cells. The problem of finding a solution, y, satisfying 
the conditions given in box (2), that also minimises 5(y,a), was mathematically mod- 
elled as a 0-1 integer linear program in Salazar- Gonzalez [12,13]. The idea consists 
of rewriting the objective function in terms of binary variables x. taking a value 0 
when the frequency for cell i is rounded down (i.e. y. = La,J), and 1 when it is rounded 
up (y. = [a.l): 

S(y,a) = C + '^q,x, ( 4 ) 

iel 

where C is a constant dependent only on a, and = La,J + f a,l - 2a^ is the extra cost 
associated with rounding up rather than down for cell i. 



2.3 Computational Implementation 

We have used the computer program implemented in Salazar-Gonzalez [13]. It is a 
branch-and-bound algorithm, a typical scheme to find an optimum solution to any 
integer linear programming problem (Wolsey [16]). The branching phase consists of 
fixing y, either to round down or up to the nearest multiple of P (or equivalently, ex- 
ploring the two subproblems in which x=0, or x=l). To avoid having to evaluate the 
consequences of doing this for all cells, there is a bounding phase of the algorithm. 
This consists of solving the linear programming relaxation of the mathematical prob- 
lem (i.e. a non-integer simplification of the original integer program, thus providing a 
lower bound for the objective function (4)) after each branching step. A further ingre- 
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dient to speed up the process is a heuristic phase that provides an upper bound of the 
objective function at each branching step. This enables unproductive branches to be 
discarded (when the lower bound is greater than or equal to the upper bound), thus 
saving computing time. It also provides a lower and upper bound on the objective 
function at each iteration, thus allowing an assessment of the properties of the best 
solution so far, which is sometimes very useful if the iterative search is taking a very 
long time, and the process is aborted. 

Another important ingredient of an efficient branch-and-bound algorithm is a pre- 
processing stage, in which the size of the program is reduced by removing redundant 
variables. For example, in this case, when frequencies are already multiples of the 
rounding base, they can be removed from the mathematical model. Redundant equa- 
tions can also be removed. In the example given in section 2.1, the 8 equations defin- 
ing the structure can be reduced to 7 linearly independent equations. 

The above algorithm was implemented in standard C programming language. 
Computing the lower bound of the objective function (4) required the use of a profes- 
sional linear programming solver, and for this purpose we used Xpress 14 (Dash Op- 
timization [5]). Specifying the structure My=0 (and the frequencies, a, for any set of 
linked tables, and other input parameters) in an appropriate form for transfer to the 
controlled rounding routines is not trivial in practice. The complete algorithm has 
been incorporated within x-Argus (Hundepool [7]), the general purpose disclosure 
control package for statistical tables (available from http://neon.vb.cbs.nl/casc/). 
X-Argus has been developed to produce aggregate tables from input microdata, and 
store the necessary metadata to guide various disclosure control routines, including 
that for controlled rounding. 

2.4 Zero-Restricted or Not 

The definition of feasibility specified in box (2) (more precisely condition 2 of this 
definition) is for the classical zero-restricted version of the problem (Cox [4]), where 
(a) frequencies that are already multiples of the rounding base are not changed in the 
rounding process; and (b) for other frequencies, there are only two choices: rounding 
up or rounding down to an immediately adjacent multiple of the rounding base. For 
some tables, no zero-restricted solution exists (Causey et al [2]), and the method then 
requires generalization to allow rounding to a non-adjacent multiple of the rounding 
base. Various generalizations are possible. Perhaps the simplest is to allow those 
frequencies that are already multiples of the rounding base to move to an adjacent 
multiple. This minor modification allows solutions to be found for many of the non- 
feasible tables described in the literature. (We have also not been able to devise a 
simple artificial table that is non-feasible using this extension.) However, this modifi- 
cation is unlikely to be sufficient for some more complex tables. Other generalizations 
involve allowing movements to a non-adjacent multiple of the rounding base. Salazar- 
Gonzalez [13] uses a parameter y to describe the freedom to be allowed in moving to 
nearby multiples of the rounding base: jumps to multiples up to (but not including) y 
H-1 rounding bases away are allowed. Setting y= 0 specifies the zero-restricted con- 
trolled rounding defined above, where the multiples are fixed, and other frequencies 
have just two possible rounded values. With 7=1, a multiple of the rounding base has 
three options, whereas other frequencies have four, viz. J; e { La,J -|3, LaJ, fajl, 
ra,l-i- P }, provided these are non-negative. 
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2.5 Specifying Protection 

A substantial advance in the current methodology for controlled rounding is that re- 
quired levels of confidentiality protection can be specified in advance for each cell of 
the table, and the algorithm will ensure that these levels are achieved in the rounding 
procedure (see Salazar-Gonzalez [13]). Typically, only a subset of cells will be speci- 
fied as requiring protection, termed unsafe cells (see section 2.6). Upper, lower and 
sliding protection levels, UPL^, LPL^, SPL,, can be specified as input parameters to the 
algorithm for the feth unsafe cell. This will ensure a sufficient margin of uncertainty 
around each such cell. For any cell in the rounded table, an intruder can compute a 
maximum and minimum of the possible range of original frequencies for the Ath cell, 
yf and respectively, using the information in the output table and the information 
provided about the rounding process, ((3 , y). The intruder obtains the upper limit, yf, 
for a given frequency, a^, by solving the following integer programming problem: 

yl = max (5) 

subject to 

M z = 0 
z > 0 

y-(YH-l)(3-i-l < Z; < y_,-i-(YH-l)(3-l for all iel 
z is integer, 

and by solving a similarly specified problem to obtain the lower limit, yf (by replac- 
ing max by min in (5)). To give an example, for the zero-restricted case (Y=0), a typi- 
cal interval is [y^ - |3h- 1, H- [3-1], provided yj>0. If y,=0, the interval is [0, (3-1]. 
When the intruder has extra information about the original values (see discussion of 
external bounds in section 2.6), then narrower intervals than these can be obtained. 
Notice also that the original value, a^, is always within the interval [ y^' , yf ] computed 
by the attacker. 

These problems are typically termed attacker problems. Solving the attacker prob- 
lems for all the unsafe cells in the table is termed the auditing phase of the controlled 
rounding procedure. 

A feasible solution, y, is protected if the following inequalities hold: 

yl>a,+UPL, «S) 

y~k LPL, 
yt - y~k ^ SPL, 



for all unsafe cell k. The term protected table also implies a feasible table as defined 
in box (2). See Salazar-Gonzalez [13] for further technical details about combining 
these different elements: incorporating constraints (6) (whose terms are defined in 
model (5) ) into the program minimizing function (4) subject to conditions (2). 
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2.6 Specifying Cell Status and Bounds on Rounded Cell Frequencies 

The mathematical formulation allows cells to be specified as 

• fixed, i.e. constrained to take a given value, 

• safe i.e. do not require any protection, or 

• unsafe, i.e. they require protection as specified in the protection levels read in for 
those cells. 

A cell is required to be fixed to a value when, for example, it has been released al- 
ready in a previous publication. In other words, the formulation allows to ensure that 
the values in the new release are consistent with those in an old release. One impor- 
tant aspect of fixed cells is that they are effectively removed from the problem, i.e. 
they do not have an associated binary variable x. in the model, thus reducing the 
search space. 

Alternatively it might be widely known that a particular frequency lies within cer- 
tain limits, but not fixed to a specific value, in which case it is important to allow for 
this when ensuring that the rounded tables provides the required protection. The re- 
sulting limits lb. and ub. (also termed external bounds) are simply inserted into the 
mathematical formulation by replacing the non-negativity constraint, > 0, in the 
attacker problem (5) by //?_.< z_.< for each cell i. 



3 Performance of the Algorithm - Computational Experiments 

We here illustrate the performance of the algorithm by summarising some computa- 
tional experiments described in Salazar-Gonzalez [13] on real datasets. 



3.1 1991 UK Census Table Extract 

This pilot example is a small extract of already published data for one region of the 
UK. Information was provided for 6 local authority districts (LADs) within one (arti- 
ficial) county. For each geographical area, there were effectively two spanning vari- 
ables, VI with 9 levels, and V2 with 3 levels; the two-way table plus all margins were 
included. The set of tables for the 6 districts and the county totals were rounded to- 
gether. The overall table (including all 6 districts and the county total) is then a 9 x 3 
X 6 table, plus all 1 and 2 way marginal tables. This pilot example was rounded virtu- 
ally instantaneously in zero-restricted form by the software. 



3.2 2001 Scottish Census Table 

The second example is of a table that appeared in the 2001 Scottish Census of Popula- 
tion. For these Scottish data, various pre-tabulation disclosure control techniques were 
considered sufficient to protect the confidentiality of the respondents. So there is no 
need to apply a further stage of rounding of confidentiality protection. Nevertheless, 
they form an interesting example of data (a) that are close in form to unprotected data. 
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and (b) that are also in the public domain, and so can be used to test our software. 
The table we used was of age (20 levels) by sex (2 levels) by living arrangements (7 
levels), together with a number of marginal tables. Considering geography, there were 
32 local authorities (LADs), within which were 1176 wards. The total number of 
cells was 431,613, and the number of equations was 105,960. Each geographical 
area has 80 equations, and there are 1209 geographical areas (1176 wards + 32 LADs 
+ 1 country). The full problem in its zero-restricted form was difficult to solve with 
(3=5: no solution was found within two days. The optimisation problems were not 
solved even when y was increased to 5. However, if the 9240 equations representing 
relationships between geographical areas were removed, separate zero-restricted solu- 
tion for all geographical areas and levels were found in 17 seconds on a computer 
Pentium 2533 Mhz. Once the control-rounded tables for the wards were found, these 
could be added to obtain control-rounded tables for LADs and for the whole of Scot- 
land that are consistent with the control-rounded ward tables. Heuristic solutions of 
this type are discussed in more generality in section 4.2. 



3.3 UK Benefits Data 

We also attempted controlled rounding to base 5 on a set of tables on benefit claim- 
ants for 10,524 wards and higher geographical levels covering the whole of the UK. 
For each area in the hierarchy there were 4 one-way tables (age, gender, family type, 
household group type, with 6, 2, 2, and 5 levels respectively), one 2-way table (col- 
lapsed age with 2 levels x family type), and an overall total. There were 220,040 cells 
and 75,572 equations. An optimally rounded table was found in 21 seconds of a per- 
sonal computer Pentium 2533 Mhz. 



4 Heuristic Approaches to Finding a Protected Table 

Sometimes protected tables are difficult to obtain with the approach outlined in sec- 
tion 2. In this section, we present some alternative approaches generating acceptable 
protected tables for such cases. 



4.1 Increasing y in Order to Find a Protected Table 

For some tables, finding any feasible solution is difficult or impossible. If the current 
search is a zero-restricted one (y=0), then we can try to find a solution without this 
constraint i.e. by trying successively y =1,2,3 etc. However, by widening the solution 
space in this way, we are also widening the space to be searched, and this might in 
turn cause new computational problems. The number of variables (and corresponding 
branches in the branch-and-bound scheme) increases substantially with y. This sug- 
gests that y should be kept small, if at all possible. It should be noted that y is also 
involved in the specification of the protection (see attacker problem (5) above). 

The parameter y need not be the same for all cells in the table. If it is necessary to 
increase y, the computational problems are likely to be fewer if y is increased for se- 
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lected cells only - where the problem are thought to lie, and where the increase in y is 
likely to do the least damage to the published table. For example, y could be increased 
for the higher levels in a hierarchical spanning variable, or for those cells with highest 
frequencies. 



4.2 Hierarchical Classifications 

Some tables have spanning variables with a hierarchical structure, such as hierarchical 
area classifications. To take an example from the UK, consider the data in section 3.3: 
there are 10,524 wards within 408 local authority districts, within 11 government 
office regions. For the data in section 3.3, the whole problem was solved automati- 
cally using the inbuilt features of the algorithm. In other cases, this might not be so. 
If a protected table cannot be found for the whole set, then a protected rounded solu- 
tion can be found separately for each table at a lower level in the hierarchy. Then, 
following a bottom-up scheme, these individual protected rounded tables can be ag- 
gregated to obtain the tables at the higher levels of the hierarchy. For example in the 
UK case, we might choose to produce separately rounded tables for each of the 408 
local authority districts, and then add these figures to get the tables for the 1 1 gov- 
ernment office regions. An example of the use of this technique is mentioned in sec- 
tion 3.2. 

It should be noted that for the higher levels obtained by aggregation, y will usually 
be higher than for the lower levels. To take a specific example, if zero-restricted 
tables are found for the lowest level, tables at the higher level will not be zero- 
restricted. This issue is not likely to be so serious as the larger perturbations occurring 
near the top of the hierarchy will usually be in cells with higher frequencies, and are 
therefore more likely to be acceptable. 



4.3 Problematic n-Way Tables 

The technique in section 4.2 could be used also to provide a protected table for a 
problematic n-way table, even in cases when there is no hierarchical structure. By 
finding protected tables for each of the (n-l)-way tables obtained by constraining the 
nth variable to take each of its values in turn, and then adding these to obtain the mar- 
ginal table summed over the nth variable, a protected table is found for the n-way 
table. If any of the (n-l)-way tables are intractable, the same procedure can be ap- 
plied recursively. This procedure guarantees finding a solution for any n-way table. 



4.4 Other Linked Tables 

A common type of linkage occurs through the occurrence of common margins in a 
number of tables. 

Protected tables can be obtained for such series by adopting a sequential approach. 
Break the complete set of linked tables into a sequence of subsets, where adjacent 
subsets in the sequence have margins in common. First carry out controlled rounding 
on the first set. Secondly, round the second set of tables sharing margins with the first, 
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by fixing the margins in the second set (using /ixer/ status, see section 2.6) that are 
also found in the first set to ensure that they take the same values obtained in the first 
set. Then proceed sequentially through the sets of tables until all are rounded. 

A second approach involves the use of a combination of deterministic rounding of 
common margins (rounding to the nearest multiple, ensuring these take the same 
rounded values in all tables) and controlled rounding as described here for remaining 
cells, using the fixed status to fix the deterministically rounded values. The controlled 
rounding of the separate tables can be carried out separately for each such table. 

Both these approaches allows the large set of linked tables to be broken down into 
smaller sets, each of which are likely to be more easily solved using the current con- 
trolled rounding algorithm. Both these methods might fail because of infeasibility of 
the controlled rounding at any stage. These problems could be solved by increasing y 
at the problematic stage. For the sequential approach, another option is to return to 
earlier tables in the sequence, and try alternative controlled rounding solutions. Dif- 
ferent solutions could be found by changing the weights, w,, in the distance function, 
given by expression (3), as suggested by Salazar-Gonzalez [13]. 



5 Steering the Algorithm in Practice 

We now discuss the problems associated with defining the input parameters required 
by the controlled rounding algorithm presented in section 2. 



5.1 Strategies for Setting Protection Levels 

In most circumstances, all cells in a control rounded table will enjoy some confidenti- 
ality protection. During the auditing phase, the intervals [y^', y*} denoting the pro- 
tected range corresponding to are calculated, and checked to see if the upper, lower 
and sliding protection levels specified are met by the control-rounded table. In the 
typical most frequencies will have some protection, given by {y^, y/]. This interval is 
frequently of width 2 (yh- 1)(3 - 2, and so it is not necessary to specify protection levels 
explicitly. 

However, in some circumstances, e.g. when tight external bounds are applied to 
some frequencies or they are set to fixed values (using /ixet/ status), it is important to 
specify the protection required. In many applications it is critical to provide protec- 
tion to lower frequencies, so one strategy is to ensure that adequate protection is pro- 
vided for all the lower frequencies (less than r, say), while providing little or no pro- 
tection to higher frequencies (> r). One method is to set the protection levels as a 
function of the unrounded frequency, a_. For example, 

LPL=a^ for a.<r , f/PL =max (s,P-aJ (7) 

where r and s are fixed parameters. The parameter r could be termed the minimum 
safe frequency, and parameter v ihe, minimum upper protection. To take a specific 
example, if P=5, r=6, i=l, LPL. and UPL. are as follows: 
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fl, LPL, UPL 

o' 0 ‘ 5 ' 

1 1 4 

2 2 3 

3 3 2 

4 4 1 

5 5 1 

6+ no protection 

Sliding protection levels can also be set to positive values but, in many cases, these 
are not necessary, and therefore they can be set to zero. It is also important to take 
any external bounds into account when setting protection levels. In order to ensure a 
protected table is found, the bounds read in here define an interval [/h., wh ], that must 
at least contain the interval [a-LPL. , a_.+[/PL], if cell i is specified as unsafe. Set- 
ting a protection level incompatible with the external bounds will result in a protected 
table not being found. 



5.2 Specifying Cell Status - Structural Zeroes 

As mentioned in section 2.6, cells can be specified as fixed, unsafe and safe. Gener- 
ally cells will be specified as unsafe or safe, depending on whether special protection 
is required or not. 

In some other circumstances, cells can be fixed, for example for structural zeroes. 
Structural zeroes are those frequencies that can logically, or using common knowl- 
edge, only take the value zero; for example, the frequency of widows under the age of 
10 (in Europe). It is therefore desirable to constrain these to remain zero in any 
rounding process. If zero-restricted rounding is used, then they will remain zero any- 
way, but if any higher value of y is used for the whole table or for these cells, they 
might well be rounded to (3 or a higher multiple of (3 (unless they are fixed to zero). 
Another common situation occurs when parts of a table have already been released, 
and it is necessary to constrain the corresponding cell to take values equal to those 
that have heen released, provided, of course, that the already released values are mul- 
tiples of the rounding base. In both these cases, we would specify the cell status as 
fixed and specify the value required to be output as a.. 



5.3 Setting External Bounds 

We have already discussed, in section 5.2, situations in which it is necessary to fix 
some cells of the rounded table. A further facility to read in external bounds, /h,, ub., 
is provided so that general knowledge that an attacker might use can be included in 
the specification. This facility is not designed to directly constrain the rounded cell 
frequencies, y., but mainly to be taken into account when assessing protection. It is 
important that the original frequencies lie within the external bounds, lb. < a^< ub. . 
This ensures consistency between the intruder knowledge assumed, and the actual 
table frequencies. 
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5.4 Avoidance of Bias 

The aim of the UK Neighbourhood Statistics project is to provide detailed statistics 
for small neighbourhoods with target populations of 200-250 people. Information is 
provided for these small areas not only for their own interest, but also so that the 
neighbourhoods can be used as building blocks from which to construct an approxi- 
mation to any larger area, A. In order to produce the frequency tables for the large 
area. A, it is then simply necessary to add the tables for the individual neighbour- 
hoods contained within A. However, it is important when doing this that the perturba- 
tions made when rounding the frequencies for each small area do not accumulate into 
very large resulting perturbations for the table for area A. In order to reduce the 
chances of this happening, the distortions made in random rounding are often speci- 
fied in such a way that they introduce no bias: the stochastic (i.e. random) process is 
defined so that the expectation of the rounded frequency in any cell is equal to the 
original frequency, E{y)=a., for all a. in the table. Expectation is used here in the usual 
statistical sense to mean the average over a large number of realizations of the sto- 
chastic process concerned. Note that this provides no guarantee about what happens 
in any individual table. It just that in the very long run (i.e. in any aggregate statistic 
consisting of a sum of very many individual rounded tables), we are not likely to 
introduc any substantial distortions in the table frequencies. 

It is possible to define bias in a systematic way for random rounding, since the pro- 
tection mechanism involves a stochastic process. It is not immediately apparent how 
to define bias for controlled rounding, since in many implementations it does not 
involve any random elements. However, just as it is possible to view the deterministic 
mechanisms used in congruential pseudo-random number generators as stochastic 
processes, we can apply a similar approximation to controlled rounding. It is a com- 
plex process whose properties we can study statistically as if it were a stochastic 
process. It is in this sense that we specify that ideally the process is approximately 
unbiased. 

We carried out a statistical analysis on the perturbations in the controlled rounding 
for each original frequency in one of our large datasets (we also got similar results for 
some other datasets), and obtained the results given in Table 2. 

It is clear that there is substantial bias in the rounding for all frequencies when us- 
ing objective function (3), but especially for original frequencies immediately adja- 
cent to integer multiples of the rounding base. There is tendency for the controlled 
rounding driven by the objective function (3) to round to the closest multiple of the 
rounding base much more frequently than an unbiased rounder would. In this respect, 
controlled rounding with objective function (3) behaves in a fashion intermediate 
between standard unbiased random rounding, and ordinary deterministic rounding (in 
which frequencies always round to the nearest multiple of the rounding base, e.g. 3, 4, 
6 and 7 always round to 5, etc). The bias found in this version of controlled rounding 
is unlikely to be important in almost all practical applications, since it would be 
unlikely that an undue preponderance of frequencies would fall in the areas immedi- 
ately above or below the rounding base. 

However, we considered whether changes in the method could be made to reduce 
this type of bias. Cox [4] discussed procedures for unbiased controlled rounding for 2- 
way tables. Salazar- Gonzalez [13] described how to model and solve the optimization 
problem when the objective function (3) in the basic method described in section 2 is 
replaced by 
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Table 2. Scottish Census tables (section 3.2). Proportion of cells rounded up to the next multip- 
le of the rounding base for each original frequency, for the original algorithm, and with an 
alternative objective function (8). There is considerable bias in the rounding especially for the 
original frequencies adjacent to multiples of the rounding base. This is much improved with 
the alternative objective function. The figures given after + are the half widths of 95% confi- 
dence intervals on the proportions (allowing for the sampling error as a result of carrying out a 
finite number of numerical experiments). These are generally very narrow. 



Original 

frequency 


No of 
occurrences 


Expected 
proportion 
rounding up 
if no bias 


Proportion rounding 
up with original 
objective function (3) 


Proportion rounding 
up with alternative 
objective function (8) 


0 


131,831 


0.0 


0.00 + 0.000 


0.00 + 0.000 


1 


33,241 


0.2 


0.04 + 0.002 


0.14 + 0.004 


2 


19,535 


0.4 


0.34 + 0.007 


0.41 + 0.007 


3 


13,934 


0.6 


0.83 + 0.006 


0.66 + 0.008 


4 


11,406 


0.8 


0.99 + 0.002 


0.85 + 0.007 


5 


9,476 


0.0 


0.00 + 0.000 


0.00 + 0.000 


6 


8,561 


0.2 


0.02 + 0.003 


0.13 + 0.007 


7 


7,412 


0.4 


0.26 + 0.010 


0.41+0.011 


8 


6,820 


0.6 


0.81+0.010 


0.66 + 0.011 


9 


6,250 


0.8 


0.99 + 0.002 


0.85 + 0.009 


10 


5,466 


0.0 


0.00 + 0.000 


0.00 + 0.000 


11 


4,906 


0.2 


0.01 + 0.003 


0.12 +0.009 


12 


4,560 


0.4 


0.26 + 0.010 


0.39 + 0.014 


13 


4,076 


0.6 


0.82 + 0.012 


0.67 + 0.015 


14 


3,870 


0.8 


0.99 + 0.003 


0.86 + 0.011 


15 


3,660 


0.0 


0.00 + 0.000 


0.00 + 0.000 


16 


3,320 


0.2 


0.02 + 0.004 


0.11+0.011 


17 


3,180 


0.4 


0.24 + 0.015 


0.40 + 0.017 


18 


2,985 


0.6 


0.83 + 0.014 


0.66 +0.017 


19 


2,913 


0.8 


0.99 + 0.003 


0.87 + 0.013 


20 


2,681 


0.0 


0.00 + 0.000 


0.00 + 0.000 



^(y,^(a)) = ^w, I 

iel 

where ^ : i g / ] is a vector of random variables derived from a. So the rounding 

procedure is now aiming to be as close as possible to ^ rather than to a itself, where ^ 
is a random transformation of a that takes the form of a map from (n/Jt-1, n(}^2, 
nP+'i, ... , n/Jf/J-1 } onto itself for all integer n. We regard the transformation from a 
to y (using objective function (3)) as a stochastic mechanism with probabilities of 
rounding up given in the 4th column of Table 2, and as we are using the same algo- 
rithm to go from ^ toy, they apply also to that transformation. We then obtain a ma- 
trix of transition probabilities from a to ^ that would result in the required probabili- 
ties of rounding up for the overall process. In effect what we are now doing is 
making two transformations, first a— » then secondly ^ ^ y, where the second 
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transformation is carried out by solving the controlled rounding problem using the 
same form of objective function as in (3) but with ^ replacing a, which is all that ob- 
jective function (8) is. In order to determine what is the appropriate transformation 
from a— ^ it is necessary, of course, to first determine what the bias of the standard 
procedure is for the particular data set being rounded. It can be done in a preliminary 
stage using the standard controlled rounding procedure. This approach is only possi- 
ble when the dataset involved provides many occurrences in which the same original 
frequencies are rounded, from which the statistical properties of the controlled- 
rounder in the specific context can be approximately estimated. If the dataset is too 
small for this, then the only guide available is general experience of controlled round- 
ing of other datasets. It remains to be seen what consistency in statistical properties of 
the rounding perturbations occurs in different datasets, as further experience is accu- 
mulated. 

The result of this exercise is that the bias is substantially reduced (see Table 2). 
For almost all practical purposes, the bias properties of the rounding procedure with 
objective function (8) are adequate. 



6 Concluding Remarks 

There are two major advantages of controlled rounding. First of all, users of the ta- 
bles find analysis easier - they do not have to cope with the inconsistencies that occur 
in conventionally rounded tables because of their non-additivity. A further advantage 
is that they are more secure: random and deterministic rounding have weaknesses that 
make it possible to unpick the rounding in some cases, precisely because the margins 
and the interior cells are rounded independently. Perhaps, the biggest barrier to the 
use of controlled rounding in official statistics as a confidentiality protection tech- 
nique is the lack of computational means of achieving it for standard tables with large 
georgaphical hierarchies, other linkages between tables at each geographical level, 
and sometimes the complex intrinsic form of some individual tables. The numerical 
experiments described in this paper demonstrate that the current technique and its 
computer implementation are quite adequate to deal with substantial real-life prob- 
lems. To deal with the stubborn cases in which the general approach does not imme- 
diately provide a solution, various heuristic solutions are outlined in this paper. The 
paper has also demonstrated another novel feature of the approach: methods of speci- 
fying confidentiality protection - and ensuring the specifications are met - are pro- 
vided. Statistical properties of the perturbation involved in the standard version of 
controlled rounding implemented are briefly discussed. Modifications of the standard 
approach are outlined to overcome any minor deficiencies in the statistical properties. 
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Abstract. Minimum-distance controlled pertnrbation is a recent family 
of methods for the protection of statistical tabular data. These methods 
are both efficient and versatile, since can deal with large tables of any 
structure and dimension, and in practice only need the solution of a linear 
or quadratic optimization problem. The purpose of this paper is to give 
insight into the behaviour of such methods through some computational 
experiments. In particular, the paper (1) illustrates the theoretical results 
about the low disclosure risk of the method; (2) analyzes the solutions 
provided by the method on a standard set of seven difficult and complex 
instances; and (3) shows the behaviour of a new approach obtained by 
the combination of two existing ones. 

Keywords: statistical disclosure control, controlled perturbation meth- 
ods, linear programming, quadratic programming. 



1 Introduction 

The safe dissemination of tabular data is one of the main concerns of national 
statistical agencies. The size and complexity of the data to be protected is con- 
tinuously increasing, which results in a need for more efficient and versatile 
protection procedures. This work deals with minimum-distance controlled per- 
turbation, a recent family of methods that meets the above requirements. 

Currently, one of the widely used techniques in practice is cell suppression, 
which is known to be a NP-hard problem [15]. Although exact mixed integer 
linear programming procedures have been recently suggested [11], the main in- 
convenience of this approach is that, due to its combinatorial nature, the solution 
of very large instances (with possibly millions of cells) can result in impractical 
execution times [13]. Several heuristics have also been suggested to obtain fast 
approximate solutions [1, 4, 7, 10, 15]. Those approaches are based on the solution 
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of several network optimization subproblems. Unfortunately, although fast, they 
can only be applied to certain classes of tables, e.g., two and three-dimensional, 
and two-dimensional with hierarchies in one dimension. 

To avoid the above lacks of cell suppression, alternative approaches have been 
introduced. One of them is the minimum-distance controlled perturbation family 
of methods. Given a set of tables to be protected, they find the closest ones (ac- 
cording to some distance measure) that, guaranteeing confidentiality, minimize 
the information loss. Members of that family of methods were independently 
suggested in [9] (the controlled table adjustment method, which uses a Li dis- 
tance) and [3] (the quadratic minimum-distance controlled perturbation, based 
on L 2 ). Specialized interior-point methods [2] were used in [5] for the solution 
of large-scale instances. A unified framework for those methods was presented 
in [6], including a proof of their low disclosure risk. 

This paper is organized as follows. Section 2 outlines the minimum-distance 
controlled perturbation framework. Section 3 illustrates the theoretical results 
about the disclosure risk of the method. Section 4 shows the behaviour of three 
particular distances on a set of seven complex instances. Finally, Section 5 reports 
the results obtained with an approach that combines the L\ and L2 distances. 

2 The Minimum-Distance 

Controlled Perturbation Framework 

This section only outlines the general model, and the particular formulations for 
the Li, L2 and Loo distances. More details can be found in [9] and [6]. 

Any problem instance, either with one table or a number of (linked or hier- 
archical) tables, can be represented by the following elements: 

— A set of cells Ui,i = l,...,n, that satisfy some linear relations Ma = b 
{a being the vector of a^’s). The method will look for the closest safe val- 
ues Xi,i = 1, . . . ,n, according to some particular distance measure L, that 
satisfy the above constraints. The distance can be affected by any positive 
semidefinite diagonal metric matrix W = diag(wi, . . . , w„). 

— A lower and upper bound for each cell i = 1, . . . , n, respectively a, and Oj, 
which are considered to be known by any attacker. If no previous knowledge 
is assumed for cell i, = 0 (a^ = —00 if a > 0 is not required) and Ui = -l-oo 
can be used. 

— A set 7^ = {ii, i 2 , . . . , ip} of indices of confidential cells. 

— A lower and upper protection level for each confidential cell i G V, respec- 
tively Ipli and upli, such that the released values satisfy either Xi > ai + upk 
or Xi < tti — Ipli- To add the above “or” constraint to a mathematical model 
we need a binary variable yi and two extra constraints for each confidential 
cell: 



Xi > -S{1 - pt) + {at + upk)yi i G V, 
Xi < Spi + (oi - lpli){l - Pi) i € V, 

Vi G {0, 1} i G V , 



( 1 ) 
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S in (1) being a large value. That results in a large combinatorial opti- 
mization problem which would constrain the effectiveness of the approach 
to small and medium sized problems. Therefore, in practice, we will assume 
the sense of the protection for each confidential cell (i.e., the values of the yi 
variables) is a priori fixed. This simplifying assumption permits to protect 
the table by the solution of a single continuous optimization problem. If the 
particular choice of protection senses (i.e., yi values) results in an infeasible 
problem, we can solve an alternative one by relaxing the constraints Ma = b 
with a large penalization for possible perturbations in the right-hand-side 
(see Section 5 for details). 

The general minimum-distance controlled perturbation method, using some 
L distance, can be formulated as the following optimization problem: 

min ||a; — a||i 

X 

subject to Mx = b ^ 2 ) 

(Li Xi <71i i = 1, . . . ,n 
Xi < Qi — Ipli or Xi>ai + upli i € V. 

The general problem (2) can also be formulated in terms of deviations or 
perturbations from the current cell values. Defining zt = Xi — ai, i = 1, . . . ,n, 
(2) can be transformed to 

min ||z||l 

Z 

subject to Mz = 0 ^ 2 ^ 

Z^<Z^<Z^ z=l,...,n 

Zi < —Ipli or Zi > upli i € V. 



where z G M" is the vector of deviations, z^ = — Ui < 0 and li = a,i — Ui >0. 

A benefit of (3) is that it can be solved without releasing the confidential data 
vector a. 

Using the L\ distance, and after some manipulation, (3) can be written as 



min 
2 + , 2 “ 

subject to 



'^Wi{zf + z^ ) 



i=l 

M{z+ -z~) = 0 

0 < z^ < Zi i = 1,. . . ,n 

0<z~<-z^ 



> upli 

= 0 



or 



z, > Ipk 
z+ = 0 



i G V, 



( 4 ) 



z~^ and z being the vector of positive and negative deviations in absolute value. 
For L 2 , (3) is 



min WiZi 

i—1 

subject to Mz = 0 

Zi< Zi<Zi z = 1, . . . ,n 
Zi < —Ipli or Zi > upli i € V. 



( 5 ) 
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Finally, for Loo, the general model (3) can be formulated as 

min + z^v 

2 + , 2 “ 

subject to M{z~^ ~ z~) = 0 

0 < < Zi i= I,. ..,n 



Z(z-p > Wi{z:^ + z^) i 

z^v>Wi{zf + z~) i^V, 

z^-p and z^p being extra variables that store the maximum deviation for, re- 
spectively, the sensitive and nonsensitive cells. 

An appropriate choice for the weights in (4-6) is Wi = l/oj, making the 
deviations relative to the cell value. These weights will be used in the com- 
putational results of the paper. (4) is a fixed version of the controlled tabular 
adjustment suggested in [9] . L 2 provides the smallest optimization problem, al- 
though it is quadratic. L\ and Loo provide linear problems, with a larger number 
of variables and constraints. Effective approaches for the solution of (4-6) were 
discussed in [6]. 

3 Illustrating the Disclosure Risk of the Method 

The theoretical results about the disclosure risk of the method were presented 
in [6]. This section summarizes them, and illustrates the low disclosure risk of 
the method through an example. 

To retrieve the original table, the attacker should compute the deviations 
applied by solving the optimization problem (3). In practice the only term known 
by the attacker is the M matrix provided by the table structure. However, assume 
the attacker has partial information, upU^i G V, being the only unknown terms 
(without loss of generality we consider all the protection senses were “upper”). 
The problem to be solved to disclose the deviations is then 

min IIz'IIl 

z' 

subject to Mz' = 0 (7) 

z[ > upli + ei, i e V, 

upli + a being the approximate values used by the attacker to obtain the approx- 
imate deviations z'. The protection of the table thus depends on how sensitive 
the solution z'* is to possible small values. This relation is explained by the 
next proposition [6]: 

Proposition 1. If z'*{e) G M” is the solution of (7) for a particular vector of 
e = (ci, . . . , e\p\) values, and p G is the Lagrange multipliers vector of the 
hounds of z' in (7) for e = 0 (i.e., the multipliers obtained when protecting the 
table), then 

Vo\\z'*{e)\\Ll^^ = p. ( 8 ) 



Zj > Ipk 

zt = 0 



0 < < -z, i = 

f > upk \ 
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10(3) 15 11 9 

8 10 12(4) 15 

10 12 11(2) 13(5) 


45 

45 

46 


28 37 34 37 


136 



(a) 



7 0-6-1 
0 0 4 -4 
-7 0 2 5 



0 0 0 0 |0 
(b) 



9 0-8-1 
0 0 5 -5 
-9 0 3 6 



0 0 0 0 |0 

(c) 



10 


4 


-11 


-3 


0 


0 


0 


6 


-6 


0 


-10 


-4 


5 


9 


0 


0 


IT 


0 


0 


0 



(d) 



Fig. 1. Example of sensitivity of the method to changes in the protection levels, (a) 
Original data a to be protected. Sensitive cells are in boldface, and upper protection 
levels are given in brackets, (b) Optimal deviations 2 computed with the Li distance, 
weights Wi = 1 , and inactive bonnds ^ = 0 and oT = cxa for all the internal cells. 
Marginal cells were fixed. The Lagrange multipliers of the bounds Zi > uph for the 
sensitive cells are /in = 0, /t23 = 2, p.33 = 4 and /t34 = 4 . The objective function - the 
sum of deviations in absolute value - is 36 . (c) and (d) Deviations z' and z" computed by 
the attacker using approximate protection levels with errors en = 623 = 633 = 634 = 1 , 
and en = 1, 623 = 2, 633 = 3 , 634 = 4 , respectively. The objective functions are 
respectively 46 and 68 which satisfy ( 9 ). 



Moreover, for, respectively, the Li and L^o distances, problem (7) is linear, and, 
for small enough vectors e = (ei, . . . , e|-p|), (8) can be recast as 

\\z'*{e)\\p-\\z*\\p = J2f,.ei, ( 9 ) 

iev 

z* being the deviations used to protect the table. 

To illustrate the above result, consider the example of Figure 1. Table (a) 
shows the original data to be protected. Sensitive cells appear in boldface, and 
their upper protection levels upU are given in brackets. Using the L\ distance, 
weights Wi = 1 , and bounds ^ = 0 and o7 = 00 for all the internal cells, the 
optimal deviations computed are shown in Table (b). The objective function 
value is \\z\\li = k»l = 36. The Lagrange multipliers of the constraints 

Zi > upli for the sensitive cells are /tn = 0, P 23 = 2, ^33 = 4 and ^,34 = 4. 
Since bounds ^ = 0 are inactive in the solution, the attacker can use (7) to 
disclose the deviations of Table (b). If, for instance, the attacker can adjust all 
the original upli protection levels, but for cell an, (in this case, if en < 4, 
623 = 633 = 634 = 0), from (9) and since /in = 0, a solution with the same 
objective function (and possibly with the same deviations) that for Table (b) 
(i.e., 36) will be obtained. However, if all the protection levels are adjusted with 
errors, a different solution will be computed. For instance, if problem (7) is 
solved with en = 623 = 633 = 634 = 1, the deviations z' obtained are those of 
Table (c). The objective function (i.e., sum of deviations) is 46, which satisfies 
(9): 46 — 36 = 1/tn -I- l/i23 + 1^33 + 1M34- If problem (7) is solved with slightly 
larger values en = 1, e23 = 2, e33 = 3, 634 = 4, the deviations z” obtained 
are shown in Table (c). Again, the objective function, 68, satisfies: 68 — 36 = 
l/4n -I- 2/423 + 3/433 + 4/434. 

Note that \\{p.i.i^-p)\\ (the norm of the Lagrange multipliers of constraints 
Zi > uph for the sensitive cells) can be used as an indicator of the protection of 
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(a) (b) (c) 



Fig. 2. Example of alternative solutions with complete information by the attacker. The 
original data a to be protected are those of Table (a) of Figure 1. Sensitive cells are 
in boldface, and upper protection levels are given in brackets, (a) and (b) Alternative 
solutions and z '[-^ , computed with two different linear programming solvers, using 
the Li distance, weights Wi = 1, and bounds 0 j_ — 0 and 57 = oo for all the internal cells. 
Marginal cells were fixed. The objective function - the sum of deviations in absolute 
value - of both solutions is 36. (c) Unique solution zl 2 for the L 2 distance, again with 
weights uii = 1, and bounds ^ = 0 and — 00 for all the internal cells. The 2-norm 
of the deviations vector is 12.12. 



the table. In theory, the larger this value, the more difficult is for an attacker 
to retrieve the original data. Real tables, with a large number of sensitive cells, 
often will have a high ||(/ri:ig73)|| value, and thus confidential. 

In some cases, even if the attacker has complete information, the right per- 
turbations can not be disclosed [6]: 

Proposition 2. Assume the attacker knows all the terms of problem (7). If the 
L 2 distance is used, the solution of that problem will provide the deviations used 
to protect the table. However, for L\ or Lao, the attacker can obtain alternative 
deviations. 

For instance, Tables (a) and (b) of Figure 2 show two alternative solutions 
with the Li distance for the data of Table (a) of Figure 1. They were obtained 
with two different implementations of the simplex algorithm, using weights Wi = 
1, and bounds Oi = 0 and a7 = 00 for all the internal cells. Marginal cells 
were fixed. The sum of deviations is 36 in both solutions. Table (c) of Figure 
2 shows, for the same data, the unique solution for the L 2 distance. Since L 2 
involves a quadratic function, the solution attempts to distribute the deviations 
among all the cells, obtaining a non-integer solution (valid for magnitude tables). 
Proposition 2 means that L\ and Lao are a bit safer when the attacker knows all 
the terms of (7), which in practice is equivalent to that the attacker knows the 
original data (thus, very unlikely). Therefore, in practice, it can be concluded 
that the three distances have the same low disclosure risk. 



4 Computational Comparison 

For the computational comparison of models (4-6) (i.e., Li, L 2 and Loo) we 
used the seven most complex instances of CSPLIB. CSPLIB is the unique cur- 
rently available set of instances for tabular data protection [11]. It can be freely 
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Table 1. Properties of the seven complex instances. 



Name 


Dimensions 


Size 


n 


\P\ 


m 


N.coef 


bts4 


4D, hierarchical 


54,54,4,4 


36570 


2260 


36310 


136912 


hierl3 


3D, hierarchical 


13,13,13 


2020 


112 


3313 


11929 


hierl6 


3D, hierarchical 


16,16,16 


3564 


224 


5484 


19996 


nine 12 


9D, linked 


10,6,6,6,6,6,6,6,6 


10399 


1178 


11362 


52624 


nine5d 


9D, linked 


4,29,3,4,5,6,5,4,5 


10733 


1661 


17295 


58135 


ninenew 


9D, linked 


10,6,6,6,6,6,6,6,6 


6546 


858 


7340 


32920 


two5in6 


6D, linked 


6,4,16,4,4,4 


5681 


720 


9629 


34310 



obtained from http : //webpages .ull . es/users/casc/#CSPlib : . These seven 
instances were also the choice in [8] and are challenging for other approaches, 
as cell suppression. As shown below, they can be solved in few seconds with 
the minimum-distance approach. Table 1 provides their main features: identifier 
(column “Name”), number of dimensions and structure - linked or hierarchical 
- (column “Dimensions”), size for each dimension (column “Size”), number of 
total cells and sensitive cells (columns “n” and “|7^|”, respectively), number of 
constraints (column “m”), and number of coefficient of the M matrix (column 
“N.coef’). The structure and size information was obtained from [8]. 

Problems (4-6) were implemented using the AMPL modelling language [12] 
and CPLEX 8.0 [14]. All runs were carried on a notebook with a 1.8 GHz proces- 
sor and 512 Mb of RAM. For L 2 we used the primal-dual interior-point algorithm 
[16], which can be considered the most efficient choice. L\ and Loo were solved 
with the two best linear programming algorithms: the simplex method and the 
primal-dual interior-point method. Although the optimal objective function is 
the same, both algorithms can provide different solutions. In this work we used 
those of the simplex method, which, in practice, provided better deviations. 

For each of the three distances. Table 3 of Appendix A show the following in- 
formation. Row “CPU” gives the CPU time in seconds for each algorithm. Rows 
“Abs. dev.” provide the mean (columns “mean”), standard deviation (columns 
“std”) and maximum (columns “max.”) of the absolute deviations (i.e., |zi|), for 
all the cells (row “all”), for the sensitive cells (row “e P”), and for the non- 
sensitive cells (row 7^”). A similar information is provided for the percentage 
absolute deviations (i.e., 100|zi|/ai) in rows “Perc. dev.”. Finally, rows “2-norm” 
report the 2-norm of the deviations (i.e., || 2 :|| 2 )) again for sensitive, nonsensitive, 
and all the cells. 

Looking at Table 3 we see that most of the optimization problems were solved 
until optimality in few seconds on a standard personal computer. L^o provides 
the slowest executions, due to the large number of constraints considered in 
(6). L 2 , solved through a quadratic interior-point solver, was always the most 
efficient choice (except for the smallest instance hierl3). In most instances the so- 
lution time of the L 2 was about half the time of the second fastest option. This 
is because, first, the complexity of solving a quadratic separable optimization 
problem is the same that for a linear one, if we use an interior-point algorithm; 
and second, problem (4) involves the double of variables that (5). The solu- 
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tion times obtained with the interior-point algorithm, for the three objectives, 
can even be improved using specialized solvers that exploit the tables structure 
[2,5]. 

For the absolute deviations, L 2 provides the lowest means and, mainly, the 
lowest standard deviations. Such lowest standard deviations are not surprising, 
since L 2 , due to its quadratic nature, attempts to evenly distribute the required 
deviations among all the cells. As for the other two distances, L^o provided better 
absolute deviations than Li, but for instances hierl3 and hierl6. That was, a 
priori, an unexpected result, since only two cells appear in the objective function 
of (6), whereas all the perturbations are considered in (4). The distribution of 
the absolute deviations (not reported in the tables) showed that Li provided the 
greater number of cells with small deviations. 

For the percentage deviations, L\ must clearly provide the best mean values, 
since its objective function is exactly the sum of percentage absolute devia- 
tions. However, L 2 provides similar mean percentage deviations, and, for most 
instances, with slightly better standard deviations. Lao provided worser means 
and standard deviations, but, as a consequence of its objective function, the 
lowest maximum values. 

Finally, the lowest 2-norms of the deviations vector are provided in all the 
instances by L 2 ■ This is a consequence of L 2 being the only quadratic objective 
of the three tested. Except for instance hierlS, Lao always provides deviations 
with better 2-norms than Li. 

From the above comments, we can conclude that Li provides the best results 
when a first-order comparison measure, as the mean percentage deviation, is 
considered. However, when a second-order measure is used, as the 2-norm of the 
deviations or the standard deviation of the percentage deviations, L 2 seems to 
be the best choice. The above is an immediate result of the objective functions 
(linear or quadratic) of the respective optimization problems. That suggests that 
a method combining L\ and L 2 could provide fairly good values for the first and 
second-order comparison measures. This alternative is exploited in next section. 



5 Combining the and L 2 Distances 



The optimization problem that results from the combination of the Li and L 2 
distances can be written in a general form as 



min 

2 + ,Z~ 

subject to 



n 

Wl + Z~)+UJ2Y2 W2,^{zt + Z~ f 



2=1 i£S 

M{z+ -z~) = 0 
0 < z^ < Zi i = 1,. . . ,n 



0<z~ < -z, 
f zf > uph \ 

l^r = o j 



i = 
or 



= 0 i 



i G V, 



(10) 



z~^ and z being the vector of positive and negative deviations in absolute value, 
and L 02 weights for the overall contribution to the objective function of re- 
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Fig. 3. Solutions of the Li and L 2 distances for the one dimensional table ui + 02 = 
as, imposing a perturbation 23 > 4 for the marginal cell. Point ( 01 , 02 ) = (12,8) 
corresponds to the original internal cell values. The other eleven points are the solutions 
obtained with the objective function of (4) using toi = to and t 02 = 1 — to, for ui = 
0, 0.1, 0.2, . . . , 0.9, 1, which combines the Li and L 2 distances through the weight factor 
ui. The 1/2 solution (computed with to = 0) is closer to (ai, 02 ), but the L\ point (to = 1) 
preserves the value of 02 . 



spectively the linear and quadratic terms, S a subset of cells affected by L2, and 
wi^i and W 2 ,i cell weights for respectively Li and ^2- This formulation is general 
enough to accommodate to several situations. For instance, it provides an always 
feasible problem if we apply the Li and L 2 terms to respectively the internal 
and marginal cells (i.e., S is the set of marginal cells), with a large penalization 
for changes in marginal values (i.e., W 2 ,i 3> 0). 

Before presenting results for the seven instances of Section 4, we first illustrate 
the behaviour of (10) on the small example of Figure 3. The table considered 
is fli + 02 = 03, with oi = 12 and 02 = 8. We imposed z\ + Z 2 = Z 3 and 
^3 ^ 4, i.e., an upper protection level of 4 is forced for the marginal sensitive 
cell. We set S = {1,2,3} (i.e., the three cells appear in the quadratic term of 
the objective function), and wi = w, tt>2 = 1 — w, w G [0, 1] being a predefined 
parameter. For lo = 1 and w = 0 the combined objective of (10) corresponds to 
the Li and L 2 distances, respectively. Using wi^i = 1/oi the optimal solution 
obtained with Li is zi = 4, Z 2 = 0 and Z3 = 4. With the same weights W 2 ,i = 
1/oi, the optimal solution provided by L 2 is zi = 2.4, Z 2 = 1.6 and Z 3 = 4. 
If integer values were required, the Zi and Z 2 values could be rounded through 
some heuristic postprocess (in that case the most reasonable choice would be 
zi = 2 and Z 2 = 2). Figure 3 shows the perturbed internal cell values obtained 
for Lu = 0, 0.1, . . . , 0.9, 1, and the original ones (03,02). Clearly, the L 2 point 
is closer to (12,8), but the Li solution preserves the value of cell 02. This is 
consistent with the results of Section 4. The combined Ti_2 objective provides 
solutions on a curve joining the Li and L 2 points. Because of the larger costs of 
the quadratic term, the optimal solution was only far enough from the L 2 point 
for uj = 0.8 and uj = 0.9. 
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Table 2. Results for the seven complex instances, for L\, L 2 and I/i_ 2 . 



name 




Li 






L 2 






1/1-2 




CPU %Dev. 2 


-norm 


CPU %Dev. 2- 


■norm 


CPU %Dev. 2 


-norm 


bts4 


16.5 


0.74 


18243 


11.5 


0.83 


7912 


45.0 


0.76 


10217 


hierlS 


3.3 


0.81 


2609 


3.8 


0.87 


2149 


7.1 


0.82 


2306 


hierl6 


19.9 


0.83 


3203 


17.1 


0.90 


2706 


31.0 


0.84 


2845 


nine 12 


382.1 


1.35 


5840 


18.3 


1.53 


4878 


43.7 


1.38 


5234 


nineSd 


126.7 


1.67 


8316 


20.4 


1.90 


5468 


30.9 


1.72 


5845 


ninenew 


27.0 


1.55 


5448 


11.1 


1.76 


4444 


26.3 


1.56 


4731 


two5in6 


13.6 


1.46 


4917 


9 


1.65 


3749 


16.8 


1.50 


4045 



In the computational results of this section we used 5={l,...,n} (i.e., all 
the cells are involved in the quadratic term) and wi^i = W 2 ,i = = 1, . . . ,n. 

According to the previous small example, we also set oji = 0.99 and 0 J 2 = 0.01. 
Table 2 shows the results obtained with Li, L 2 and the combined Li _2 objective. 
For each distance, the execution time (columns “CPU”), average percentage de- 
viation for all the cells (columns “%Dev.”), and 2-norm of the deviations vector 
(columns “2-norm”) are provided. Executions were carried on the same hard- 
ware and with the same software (i.e., AMPL-I-CPLEX 8.0) than in Section 4. 
The results reported for L\ were obtained with the simplex method, while the 
quadratic interior-point algorithm was used for L 2 and Ti_ 2 - Looking at Table 2 
we see the combined Li _2 distance provides average percentage deviations close 
to those of Li, while the 2-norm has been significantly reduced. As expected, 
the combined L 1-2 distance inherited the good properties of L\ and L 2 - 

6 Conclusions 

As shown by the computational experiments of this work, the minimum-distance 
approach is efficient, versatile and safe. The three methods tested, for Li, L 2 
and Loo, provided different patterns of deviations, each of them with a clear 
behaviour. As done in the paper with L\ and L 2 , it is possible to combine them 
in a new approach with the good features of the original methods. 

One of the fields of research to be be explored deals with the optimization 
solvers. In a static environment, the final goal might be the protection, in a single 
run, of all the tables derived from the same microdata. The resulting problem 
is huge. In a dynamic environment, the goal would be the online protection of 
particular tables (e.g., obtained from end-user queries from a data- warehouse). 
Speed is instrumental in that case. In both situations, we may need highly- 
efficient implementations of the optimization methods used in this work, which 
exploit the problem structure. Some steps have already been done in this direc- 
tion for large (i.e., one million cells) three-dimensional tables and L 2 [5], where a 
specialized implementation was two orders of magnitude faster than the CPLEX 
8.0 solver. Extending those achievements to general tables is part of the future 
work to be done. 
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Appendix 

A Tables with Results for the Seven Instances 

Table 3. Results for the seven complex instances, for L\, L 2 and Ltx>- 



Results for the bts4 instance 

















L 2 






J 


boo 




Simplex 


Int. Point 




Int. 


Point 




Simplex 


Int. Point 


CPU 


16.46 


39.7 




11.45 




1594.69 


207.02 






mean 


std 


max. 




mean 


std 


max. 




mean 


std max. 


Abs. 


all 


33.9 


89.2 


4483.0 


all 


24.9 


33.0 


795.9 


all 


30.1 


49.0 947.8 


dev. 


eV 


56.0 


32.2 


155.0 


GP 


56.0 


32.2 


155.0 


GP 


57.0 


32.6 168.5 






32.4 


91.5 


4483.0 




22.9 


32.0 


795.9 


0 


28.4 


49.4 947.8 






mean 


std 


max. 




mean 


std 


max. 




mean 


std max. 


Perc. 


all 


0.74 


1.97 


11.11 


all 


0.84 


1.95 


20.23 


all 


1.10 


2.36 11.11 


dev. 


£V 


7.27 


2.60 


11.11 


eP 


7.27 


2.59 


11.11 


GP 


7.46 


2.61 11.11 






0.31 


0.83 


11.03 


0^ 


0.42 


0.84 


20.23 


0 


0.68 


1.63 11.03 




all 




18243.0 


all 




7912.0 




all 




10997.2 


2-norm 


eP 




3072.3 




GP 




3070.3 




GP 




3120.5 








17982.4 


0 




7292.0 




0 




10545.2 



Results for the hierlS instance 









Ui 








L 2 




Loo 






Simplex 


Int. Point 




Int. 


Point 1 


Simplex 


Int. Point 


CPU 




3.25 


6.86 




3.83 1 




5.85 


35.23 






mean 


std 


max. 




mean 


std max. 




mean 


std 


max. 


Abs. 


all 


37.8 


44.1 


344.0 


all 


33.9 


33.7 313.4 


all 


52.0 


58.2 


463.7 


dev. 


GP 


55.6 


28.0 


97.0 


GP 


55.2 


27.8 97.0 


GP 


59.0 


27.8 


97.0 




0 


36.8 


44.6 


344.0 


0 


32.7 


33.6 313.4 


0 


51.5 


59.4 


463.7 






mean 


std 


max. 




mean 


std max. 




mean 


std 


max. 


Perc. 


all 


0.81 


1.72 


9.97 


all 


0.87 


1.95 45.84 


all 


1.04 


1.91 


9.97 


dev. 


GP 


6.20 


2.17 


9.97 


GP 


6.18 


2.19 9.97 


GP 


6.65 


2.37 


9.97 




0 


0.49 


1.02 


8.28 


0 


0.56 


1.42 45.84 


0 


0.71 


1.25 


8.28 




all 




2609.6 




all 




2149.3 


all 




3504.9 




2-norm 


GP 




658.5 




GP 




654.1 


GP 




689.3 






0 




2525.1 




0 




2047.4 


0 




3436.4 
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Table 3. (Continued). 



Results for the hierl6 instance 
Li L2 Lod 



CPU 


Simplex 

19.85 


Int. Point 
28.36 


Int. Point 
17.19 


Simplex 

66.52 


Int. Point 
136.86 






mean 


std max. 




mean 


std 


max. 




mean 


std 


max. 


Abs. 


all 


35.8 


40.0 280.5 


all 


33.4 


30.6 


258.3 


all 


36.8 


36.6 


300.9 


dev. 


eV 


48.3 


27.4 131.0 


GP 


48.3 


27.4 


131.0 


GP 


48.7 


27.4 


131.0 






34.9 


40.6 280.5 


iv 


32.4 


30.6 


258.3 


iv 


36.0 


37.0 


300.9 






mean 


std max. 




mean 


std 


max. 




mean 


std 


max. 


Perc. 


all 


0.83 


1.84 10.00 


all 


0.90 


1.81 


10.00 


all 


1.13 


2.05 


10.00 


dev. 


eP 


6.89 


2.38 10.00 


GP 


6.89 


2.38 


10.00 


GP 


7.04 


2.41 


10.00 






0.43 


0.78 7.59 


iv 


0.50 


0.75 


7.59 


iv 


0.73 


1.26 


7.59 




all 




3203.5 


all 




2706.3 




all 




3098.4 




2-norm 


£V 




830.2 


GP 




830.2 




GP 




836.8 






0> 




3094.1 


0> 




2575.9 




0> 




2983.2 





Results for the ninel2 instance 



CPU 


Simplex 

382.13 


Int. Point 
47.38 


Int. Point 
18.29 


Simplex 

727.28 


Int. Point 
338.8 






mean 


std max. 




mean 


std 


max. 




mean 


std max. 


Abs. 


all 


36.3 


44.3 490.9 


all 


34.6 


33.0 


377.4 


all 


32.6 


36.5 268.0 


dev. 


GP 


51.7 


28.3 154.0 


GP 


51.6 


28.2 


154.0 


GP 


52.1 


28.2 154.0 




iv 


34.4 


45.6 490.9 


iv 


32.4 


33.0 


377.4 


iv 


30.1 


36.7 268.0 






mean 


std max. 




mean 


std 


max. 




mean 


std max. 


Perc. 


all 


1.35 


2.34 12.55 


all 


1.53 


2.32 


25.43 


all 


1.74 


2.64 10.00 


dev. 


GP 


6.71 


2.38 10.00 


GP 


6.70 


2.39 


11.97 


GP 


6.82 


2.40 10.00 




0> 


0.67 


1.15 12.55 




0.87 


1.23 


25.43 




1.09 


1.85 8.95 




all 




5840.1 


all 




4878.1 




all 




4988.1 


2-norm 


GP 




2022.2 


GP 




2017.2 




GP 




2034.2 




0> 




5478.8 






4441.5 








4554.5 



CPU 

Abs. 

dev. 



Perc. 

dev. 



Results for the nineSd instance 



Simplex 

126.67 


Int. Point 
43.03 


Int. Point 
20.36 


Simplex 

784.52 


Int. Point 
137.33 




mean 


std max. 




mean 


std 


max. 




mean 


std max. 


all 


41.4 


68.8 1010.0 


all 


37.2 


37.5 


499.4 


all 


34.4 


38.4 306.5 


GP 


50.6 


29.3 156.0 


GP 


50.6 


29.3 


156.0 


GP 


50.8 


29.3 156.0 


iv 


39.7 


73.6 1010.0 




34.7 


38.3 


499.4 


0 


31.4 


39.1 306.5 




mean 


std max. 




mean 


std 


max. 




mean 


std max. 


all 


1.67 


2.69 10.00 


all 


1.90 


2.53 


10.00 


all 


2.23 


3.02 10.00 


GP 


6.83 


2.42 10.00 


GP 


6.83 


2.42 


10.00 


GP 


6.87 


2.42 10.00 




0.73 


1.31 9.78 


0^ 


1.00 


1.11 


9.31 


0 


1.38 


2.25 8.79 


all 




8316.4 


all 




5468.3 




all 




5343.4 


GP 




2383.6 


GP 




2383.2 




GP 




2389.6 






7967.5 


0 




4921.7 




0 




4779.4 



2-norm 







86 Jordi Castro 

Table 3. (Continued). 



Results for the ninenew instance 





Simplex 


Int. Point 




Int. 


Point 




Simplex 


Int. Point 


CPU 


27.08 


24.02 




11.15 




199.39 


120.52 






mean 


std 


max. 




mean 


std 


max. 




mean 


std 


max. 


Abs. 


all 


41.6 


53.0 


602.7 


all 


38.6 


39.0 


522.8 


all 


39.0 


43.2 


439.1 


dev. 


£V 


52.4 


28.6 


192.0 


eV 


52.4 


28.3 


192.0 


eP 


53.0 


28.3 


192.0 






39.9 


55.5 


602.7 


^P 


36.6 


40.0 


522.8 


^P 


36.8 


44.7 


439.1 






mean 


std 


max. 




mean 


std 


max. 




mean 


std 


max. 


Perc. 


all 


1.56 


2.47 


16.16 


all 


1.76 


2.44 


22.86 


all 


2.19 


2.93 


10.00 


dev. 


£V 


6.66 


2.38 


10.00 


eP 


6.66 


2.36 


10.00 


eP 


6.79 


2.39 


10.00 






0.79 


1.29 


16.16 


^p 


1.02 


1.35 


22.86 


^p 


1.50 


2.32 


9.93 




all 




5447.5 




all 




4444.3 




all 




4708.1 




2-norm 


eV 




1749.6 




eP 




1744.5 




eP 




1759.5 










5158.9 




^p 




4087.6 




^p 




4366.9 





Results for the two5in6 instance 



CPU 


Simplex 

13.58 


Int. Point 
16.88 




Int. 


Point 

9 




Simplex 

83.48 


Int. Point 
86.47 






mean 


std max. 




mean 


std 


max. 




mean 


std max. 


Abs. 


all 


38.3 


52.8 530.0 


all 


35.4 


34.9 


340.1 


all 


38.3 


39.3 281.8 


dev. 


eP 


49.1 


32.0 169.0 


eP 


49.1 


32.0 


169.0 


eP 


49.7 


31.8 169.0 




^P 


36.7 


55.0 530.0 


^P 


33.5 


34.9 


340.1 


^P 


36.7 


40.0 281.8 






mean 


std max. 




mean 


std 


max. 




mean 


std max. 


Perc. 


all 


1.46 


2.49 10.00 


all 


1.65 


2.40 


17.88 


all 


2.08 


2.81 10.00 


dev. 


eP 


6.80 


2.42 10.00 


eP 


6.80 


2.42 


10.00 


eP 


6.99 


2.42 10.00 




^p 


0.69 


1.23 9.69 


^p 


0.90 


1.17 


17.88 


^p 


1.37 


2.04 8.44 




all 




4917.2 


all 




3749.3 




all 




4137.1 


2-norm 


eP 




1573.0 


eP 




1572.0 




eP 




1582.4 




^p 




4658.8 


^p 




3403.8 




^p 




3822.5 






Balancing Quality and Confidentiality 
for Multivariate Tabular Data 



Lawrence H. Cox', James P. Kelly^, and Rahul PatiP 

' National Center for Health Statistics, Centers for Disease Control and Prevention 
Hyattsville, MD 20782 USA 
2 OptTek Systems, Inc., Boulder, CO 80302 USA 



Abstract. Absolute cell deviation has been used as a proxy for preserving data 
quality in statistical disclosure limitation for tabular data. However, users’ pri- 
mary interest is that analytical properties of the data are for the most part pre- 
served, meaning that the values of key statistics are nearly unchanged. More- 
over, important relationships within (additivity) and between (correlation) the 
published tables should also be unaffected. Previous work demonstrated how to 
preserve additivity, mean and variance in for univariate tabular data. In this pa- 
per, we bridge the gap between statistics and mathematical programming to 
propose nonlinear and linear models based on constraint satisfaction to preserve 
additivity and covariance, correlation, and regression coefficient between data 
tables. Linear models are superior than nonlinear models owing to simplicity, 
flexibility and computational speed. Simulations demonstrate the models per- 
form well in terms of preserving key statistics with reasonable accuracy. 

Keywords: Controlled tabular adjustment, linear programming, covariance 



1 Introduction 

Tabular data are ubiquitous. Standard forms include count data as in population and 
health statistics, concentration or percentage data as in financial or energy statistics, 
and magnitude data such as retail sales in business statistics or average daily air pollu- 
tion in environmental statistics. Tabular data remain a staple of official statistics. Data 
confidentiality was first investigated for tabular data [1, 2]. Tabular data are additive 
and thus naturally related to specialized systems of linear equations: TX = 0, where X 
represents the tabular cells and T the tabular equations, the entries of T are in the set 
{-1, 0, H-1 }, and each row of T contains precisely one -1. 

In [3], Dandekar and Cox introduced a methodology for statistical disclosure 
limitation [4] in tabular data known as controlled tabular adjustment (CTA). 

This development was motivated by computational complexity, analytical obsta- 
cles, and general user dissatisfaction with the prevailing methodology, complementary 
cell suppression [1, 5]. Complementary suppression removes from publication all 
sensitive cells - cells that cannot be published due to confidentiality concerns - and in 
addition removes other, nonsenstive cells to ensure that values of sensitive cells can- 
not be reconstructed or closely estimated by manipulating linear tabular relationships. 
Drawbacks of cell suppression for statistical analysis include removal of otherwise 
useful information and consequent difficulties analyzing tabular systems with cell 
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values missing not-at-random. By contrast, the CTA methodology replaces sensitive 
cell values with safe values, viz., values sufficiently far from the true value. Because 
the adjustments almost certainly throw the additive tabular system out of kilter, CTA 
adjusts some or all of the nonsensitive cells by small amounts to restore additivity. 
CTA is implemented using mathematical programming methods for which commer- 
cial and free software are widely available. 

In terms of ease-of-use, controlled tabular adjustment is unquestionably an 
improvement over cell suppression. As CTA changes sensitive and other cell values, 
the issue is then; Can CTA be accomplished while preserving important data 
analytical properties of original data? 

In [6], Cox and Dandekar describe how the original CTA methodology can be 
implemented with an eye towards preserving data analytic outcomes. In [7], Cox and 
Kelly demonstrate how to extend the mathematical programming model for CTA to 
preserve univariate properties of original data important to linear statistical models. In 
this paper, we extend the Cox-Kelly paradigm to the multivariate case, ensuring that 
covariances, correlations and regression coefficients between original variables are 
preserved in adjusted data. Specialized search procedures, including Tabu Search [8], 
can also be employed in formulations proposed in [7]. While it is easy to formulate 
nonlinear programming (NLP) models to do this, such models present difficulties in 
understanding and use in general statistical settings and exhibit computational limita- 
tions. The NLP models are computationally more expensive than linear programming 
(LP) models because the LP algorithms and re-optimization processes are much more 
efficient compared to NLP algorithms. Moreover, LP solvers guarantee global opti- 
mality, whereas to NLP models cannot guarantee global optimality for non-convex 
problems. Models presented here are based on linear programming and consequently 
are easy to develop and use and are applicable to a very wide range of problems types 
and sizes. 

Section 2 provides a summary of the original CTA methodology of [3] and linear 
methods of [7] in the univariate case for preserving means and variances of original 
data and for ensuring high correlation between original and adjusted data. Section 3 
provides new results addressing the multivariate case, providing linear programming 
formulations that ensure covariance, correlation and regression coefficient between 
two original variables exhibited in original data are preserved in adjusted data. Sec- 
tion 4 reports computational results. Section 5 provides concluding comments. 



2 Controlled Tabular Adjustment and Data Quality 
for Univariate Data 

2.1 The Original CTA Methodology 

CTA is applicable to tabular data in any form but for convenience we focus on magni- 
tude data, where the greatest benefits reside. A simple paradigm for statistical disclo- 
sure in magnitude data is as follows. A tabulation cell, denoted i, comprises k respon- 
dents (e.g., retail clothing stores in a county) and their data (e.g., retails sales and 
employment data). It is assumed that each respondent knows the identity of the other 
respondents. The cell value is the total value of a statistic of interest (e.g., total retail 
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sales), summed over nonnegative contributions of each respondent in the cell to this 
statistic. Denote the cell value and the respondent contributions vj'^ , ordered from 

largest to smallest. It is possible for any respondent j to compute - v/ ^ which 

yields an upper estimate of the contribution of any other respondent. This estimate is 
closest, in percentage terms, when the target is the largest respondent and j = 2. A 
standard disclosure rule, the p-percent rule, declares that the cell value represents 
disclosure whenever this estimate is less than (100 + p)-percent of the largest contri- 
bution. The sensitive cells are those failing this condition. 

We also may assume that any respondent can use public knowledge to estimate the 
contribution of any other respondent to within q-percent (q > p, e.g., q = 50%). This 

additional information allows the second largest to estimate ■ vi‘^ ~ Vi'^ > the sum 
of all contributions excluding itself and the largest, to within q-percent. This upper 
estimate provides the second largest a lower estimate of . The lower and upper 
protection limits for the cell value equal, respectively, the minimum amount that must 
be subtracted from (added to) the cell value so that these lower (upper) estimates are 

at least p-percent away from the response . Numeric values outside the protection 
limit range of the true value are safe values for the cell. A common practice assumes 
that these protection limits are equal, to p. . Complementary cell suppression sup- 
presses all sensitive cells from publication, replacing sensitive values by variables in 
the tabular system TX = 0. Because, almost surely, one or more suppressed sensitive 
cell values can be estimated via linear programming to within its unsafe range, it is 
necessary to suppress some nonsensitive cells until no sensitive estimates are obtain- 
able. This yields a mixed integer linear programming (MILP) problem, as in [9]. 

The original controlled tabular adjustment methodology [3] replaces each sensitive 
value with a safe value. This is an improvement over complementary cell suppression 
as it replaces a suppression symbol by an actual value. However, safe values are not 
necessarily unbiased estimates of true values. To minimize bias, [3] replaces the true 

value by either of its nearest safe values, - p. or + p. . Because this will al- 
most surely throw the tabular system out of kilter, CTA adjusts nonsensitive values to 
restore additivity. Because choices to adjust each sensitive value down or up are bi- 
nary, combined these steps define a MILP [10]. Dandekar and Cox [3] present heuris- 
tics for the binary choices. The resulting linear programming relaxation is easily 
solved. 

A (mixed integer) linear program in itself will not assure that analytical properties 
of original and adjusted data are comparable. Cox and Dandekar [6] address these 
issues in three ways. First, sensitive values are replaced by nearest safe values to re- 
duce statistical bias. Second, lower and upper bounds (capacities) are imposed on 
changes to nonsensitive values to ensure adjustments to individual datum are accepta- 
bly small. Statistically sensible capacities would, e.g., be based on estimated meas- 
urement error for each cell et ■ Third, the linear program optimizes an overall measure 
of data distortion such as minimum sum of absolute adjustments or minimum sum of 
percent absolute adjustments. The MILP model is as follows. 
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Assume there are n tabulation cells of which the first s are sensitive, original data 

are represented by the nxl vector a, adjusted data by a + - y ; and y = y^ - y ■ 

The MILP of [10] corresponding to the methodology of [3] for minimizing sum of 
absolute adjustments is; 

n 

min^f y; + y.) 

i=l 

subject to: T (y) = 0 (1) 

yi=Pi(i-Ii)’ y!=Pih^ I, binary, i=l,...,s 
0^y-,y-^ei i = s+l,...,n 

If the capacities are too tight, this problem may be infeasible (lack solutions). In 
such cases, capacities on nonsensitive cells may be increased. A companion strategy, 
albeit controversial, allows sensitive cell adjustments smaller than p. in well-defined 

situations. This is justified mathematically because the intruder does not know if the 
adjusted value lies above or below the original value; see [6] for details. 

The constraints used in [6] are useful. Unfortunately, choices for the optimizing 
measure are limited to linear functions of cell deviations. In the next two sections, we 
extend this paradigm in two separate directions, focusing on approaches to preserving 
mean, variance, correlation and regression coefficient between original and adjusted 
data. 

Formulation (1) is a mixed integer linear program, the integer part can be solved by 
exact methods in small to medium-sized problems or via heuristics which first fix the 
integer variables and subsequently use linear programming to solve the linear pro- 
gramming relaxation [3,11]. Cox and Kelly [11] propose different heuristics to fix the 
binary variables and to reduce the number of the binary variables in order to improve 
the computational efficiency, and report good solutions in reasonable time. The re- 
mainder of this paper focuses on the problem of preserving data quality under CTA, 
and is not concerned with how the integer portion is being or has been solved. For that 
reason, for convenience we occasionally abuse terminology and refer to (1) and sub- 
sequent formulations as “linear programs”. 



2.2 Using CTA to Preserve Univariate Relationships 

We present linear programming formulations of [7] for preserving approximately 
mean, variance, correlation and regression slope between original and adjusted data 
while preserving additivity. 

Preserving mean values is straightforward. Any cell value can be held fixed by 
forcing its corresponding adjustment variables y* , y. to zero, viz., set each vari- 
able’s upper capacity to zero. Means are averages over sums. So, for example, to fix 
the grand mean, simply fix the grand total. Or, to fix means over all or a selected set 
of rows, columns, etc., in the tabular system, simply capacitate changes to the corre- 
sponding totals to zero. To fix the mean of any set of variables for which a corre- 
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sponding variable has not been defined, incorporate a new constraint into the linear 
system: ^ ^ y + _ y - where the sum is taken over the set of variables of interest. 

To ensure feasibility and allow a richer set of potential solutions, it is useful to allow 
adjustments to sensitive cells to vary a bit beyond nominal protection limits, using 
capacities. The corresponding MILP is: 

mincf y) 

subject to: T (y) = 0 (2) 

I,(y:-y;)=o 

a 

qi(i - li)^ y- ^ Pi(i - li), ^ Piii ; 

0<y.\ <ei i = s+l,...,n 

c( y) is used to keep adjustments close to their lower limit, e.g., c( y) 

= Hy" + y‘) ■ 

Cox and Kelly [7] demonstrate that the data quality objectives— variance, correla- 
tion and regression slope— can be realized by forcing the covariance between original 
data a and adjustments to original data y to be close to zero while preserving the cor- 
responding mean value(s). For variance, any subset of cells of size f with y = 0, 

Var( a -b y) = (l/t)(X((ai + Yi - (a + y))^) 

= Var(a) + (2/t) Ka; - a) y; + Var(y) 

Define L( y) = Cov(a, y)/Var(a) .As y = 0 , 

then L( y) = (l/(tVar(a))) ^ (ai - a) y; , so 

i=l 

Var( a -i- y)/Var(a) = 2L(y) + (1 + Var(y)/Var(a)) and 
I Var( a -i- y)/Var(a) - 1 1= 1 2L(y) -i- (Var(y)/Var(a)) | 

Thus, relative change in variance can be minimized by minimizing the right-hand 
side. As Var( y)/Var(a) is typically small, it suffices to minimize | L( y) | or at 
least to reduce it to an acceptable level. This can be accomplished as follows: 

a) Incorporate two new linear constraints into the system (2): 

w >L(y), w > - L(y) 

b) Minimize w. 



( 3 ) 
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Using standard methods (e.g., constraining w and -w to be less than a small quan- 
tity) we treat (3) as a set of constraints leaving us free to specify a convenient linear 
objective function, such as sum of absolute adjustments, X Yj - In such cases, we say 

that we are setting the corresponding linear functional (e.g., L(y)) to a numeric value 
(typically, zero) “exactly or approximately”. 

As y = 0 , then Var( y) = X y-^ , minima of which are minima of sum of abso- 
lute adjustments, ^ | y , | • Thus, an improved alternative to (3) for preserving vari- 
ance would be to solve (2) for minimum sum absolute adjustments, compute mini- 
mum variance of adjustments, and set L(y) = - min Var(y)A^ar(a) exactly or 
approximately. This works well if we are interested in preserving mean and variance 
only. Formulation (3) is useful below for preserving correlation and regression slope, 
and consequently may be favored in most applications. 

Regarding correlation, the objective is to achieve high positive correlation between 
original and adjusted values. We seek Corr (a, a H- y) = 1, exactly or approximately. 

As y = 0, 

Corr(a, a + y)-Cov(a,a + y ) / jvar(a)Var(a+y) 

= (1 + L(y)) / yj Var(a + y)/ Var(a) 

With Var(y)/Var(a) typically small, the denominator should be close to one, and 
min I L( y) I subject to (2) should do well in preserving correlation. Note that 
denominator equal to one is equivalent to preserving variance, which as we have seen 
also is accomplished via min \ L( y) | . 

Finally, we seek to preserve ordinary least squares regression Y — P^K + P^ of ad- 
justed data Y = a H- y on original data X = a, viz., we want P^ near one and Pg near 
zero. 

/?;= Cov( a -py, a) /Var(a) = l-pL(y), p^ = {a. + y)- p^a. 

As y = 0, then Pg —0, P^ —1 whenever L(y) = 0 is feasible. This again corre- 
sponds to min \ L( y) I subject to the constraints of (2), viz., to (3). 

Thus, data quality as regards means, variances, correlation and regression slope can 
be preserved under CTA in the univariate case by the mathematical program (3). Cox 
and Kelly [7] report computational results for a 2-dimensional table of actual data and 
a hypothetical 3-dimensional table that are nearly perfect in preserving all quantities 
of interest. 



3 Using CTA to Preserve Multivariate Relationships 

In place of a single data set organized in tabular form, viz., Ta = 0, to which adjust- 
ments y are to be made for confidentiality purposes, henceforth we consider multiple 
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data sets, each organized within a common tabular structure T. This is the typical 
situation in official statistics where, for example, tabulations would be shown at vari- 
ous levels of geography and industry classification for a range of variables such as 
total retail sales, cost of goods, number of employees, etc. 

For concreteness, we focus on the bivariate case. Original data are denoted a , b 
and corresponding adjustments to original values are denoted by variables y , z. In the 
univariate case, the key to preserving variance, correlation and regression slope was 
to force Cov (a, y) = 0. Though trivial, it is easy to overlook in the univariate case 
that as Var (a) = Cov (a, a), then preserving variance via Cov (a, y) = 0 is equivalent 
to requiring Cov (a, a -H y) = Cov (a, a). In the multivariate situation, however, pre- 
serving covariance (and variance) is of key importance and not to be overlooked. 
Namely, if we can preserve mean values and the variance-covariance matrix of origi- 
nal data, then we have preserved essential properties of the original data, particularly 
in the case of linear statistical models. We also would like to preserve simple linear 
regression of original data b on original data a in the adjusted data. These are the 
objectives of Section 3. 



3.1 Preserving Means and Univariate Properties 

This was the subject of the preceding section. Here we need only establish notation. 
Continuing our focus on the bivariate case, we begin with two copies of the mathe- 
matical program (3), one expressed in a and y and the other expressed in b and z. By 
virtue of the preceding section, this is sufficient to preserve both univariate means and 
variances. 



3.2 Preserving the Variance-Covariance Matrix 

The separate copies of model (3) of the preceding section preserve the univariate 
variances Var (a) and Var (b). To preserve Cov (a, b), we require: 

Cov (a, b) = Cov (a H- y, b H- z) = Cov (a, b) + Cov (a, z) + Cov (b, y) + Cov (y, z) 

Consequently, we seek a precise or approximate solution to: 

min I Cov (a, z) + Cov (b, y) + Cov (y, z)| , 

subject to (3) (4) 

The first two terms in this objective function are linear and not pose a problem, but 
the last term is quadratic. We could apply quadratic programming to (4), and for some 
problems this will be acceptable computationally. For general problems and interest, 
we continue to pursue an approximate linear formulation. For practical purposes this 
formulation will express the objective in the constraint system. Our linear approach to 
solving (4) heuristically is: perform successive alternating linear optimizations, viz., 
solve (2) for y = y^^ , substitute y„ into (4) and solve for z = and continue in this 
fashion until an acceptable solution is reached. 
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3.3 Preserving the Simple Linear Regression Coefficient 

Our objective is to preserve the estimated regression coefficient under simple linear 
regression of b on a. We do not address here related issues of preserving the standard 
error of the estimate and goodness-of-fit. We seek exactly or approximately: 

Cov (a, b) / Var (a) = Cov (a + y, b + z) / Var (a + y) 

Var (a + y) / Var (a) = Cov (a + y, b + z) / Cov (a, b) 

= 1 + Cov (a, z) / Cov (a, b) + Cov (b, y) / Cov (a, b) + Cov (y, z) / Cov (a, b) 

Recall: Var(a + y)A^ ar(a) = 2L(y) + 1 + Var(y)/Var(a) 

2L(y) + Var(y)/Var(a) = Cov (a, z) / Cov (a, b) + Cov (b, y) / Cov (a, b) 

+ Cov (y, z) / Cov (a, b) 

To preserve univariate properties, impose the constraint L (y) = 0 exactly or ap- 
proximately. To preserve bivariate covariance, impose Cov (y, z) = 0 exactly or ap- 
proximately. So, if in addition we seek to preserve the regression coefficient, then we 
must satisfy the linear program: 

min |(Cov (a, z) + Cov (b, y)) / Cov (a, b)|, subject to (4) (5) 

In implementation, the objective is represented as a near-zero constraint on the 
absolute value. 



3.4 Preserving Correlations 

The objective here is to ensure that correlations between variables computed on ad- 
justed data are close in value to correlations based on original data, viz., that, exactly 
or approximately Corr (a, b) = Corr (a + y, b + z). After some algebra, preserving 
correlation is equivalent to satisfying, exactly or approximately: 

jVar(a + y) jVar(b + z) _ Cov(a + y,b + z) 
y Var(a) ]j Var(b) Cov(a,b) 

Methods included in (5) for preserving univariate variances and covariance in many 
cases will preserve correlation. Otherwise, iteration aimed at controlling the right 
hand product may help. 



4 Results of Computational Simulations 

We tested the performance of the proposed linear formulations on three 2-dimensional 
tables for both univariate and bivariate statistical measures. Three tables were taken 
from a 4x9x9 3-dimensional table. This table contained actual magnitude data and 
disclosure was defined by a (1 contributor, 70%) dominance rule, viz, a cell is sensi- 
tive if the largest contribution exceeds 70% of the cell value. This results in protection 
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limits p. = {v \ ) / 0.7 — v' . Tables A, B, C contain 6, 5, 4 sensitive cells respectively. 

Upper bounds (capacities) for adjustments to nonsensitive cells were set at 20-percent 
of cell value. 

First, we used the MILP formulation to compute exact solutions for the three in- 
stances AB, AC, BC. The MILP formulation used the mean and variance preserving 
constraints and a covariance change minimization objective, aimed at preserving both 
univariate and bivariate statistics. This kind of modeling has the advantage of building 
flexibility for controlling information loss. For example, if preserving the covariance 
is not important, then covariance preserving constraints would be relaxed to get better 
performance with respect to other statistics. 

We used the ILOG-CPLEX-Concert Technology optimization software to solve 
the resulting MIP and LP problems. The program code was written using C-I-+ and 
Concert Technology libraries. Table 1 reports performance on covariance, correlation, 
regression coefficient, ‘i’ vector variance, ‘j’ vector variance (e.g., i = B, j = C). Table 
values are percent change in the original statistics. Means were preserved for all three 
instances. The results are encouraging: information loss (change) to the key statistics 
was low using the linear formulation. This is desirable because using the linear 
formulation to preserve univariate and bivariate statistics simultaneously offers 
advantages on scalability, flexibility, and computational efficiency. The loss on the 
bivariate statistical measures (regression coefficient, correlation) was considerably 
lower because the bivariate constraints offer more flexibility for adjusting cell values. 



Table 1. Performance of the linear formulations on key statistics (in percent change). 



Cases 


Covariance 

change 


Correlation 

change 


Regression 
Coeff. change 


Variance 
i change 


Variance 
j change 


Original 

Correlation. 

Coefficient 


AB 


3.15 


1.09 


5.94 


-3.22 


6.2 


0.77 


AC 


1.13 


2.63 


1.14 


-2.43 


0.1 


0.40 


BC 


3.6 


6.12 


6.7 


-3.6 


-1.89 


0.49 


Average 


2.62 


3.28 


4.59 


-3.08 


1.47 


0.55 



We formulated the problem as a binary integer program, which in general has many 
feasible solutions. It would be interesting to study the behavior of these feasible solu- 
tion vectors on the various statistics to determine the distribution of information loss 
over particular statistics. As an illustration, we evaluated performance of all the feasi- 
ble solution vectors for the instance AB. We plotted the performance of these solu- 
tions on covariance and absolute cell deviation as shown in Figures 1- 2. Covariance 
varied considerably across different feasible solutions. However, the absolute cell 
deviation did not. Absolute cell deviation is one of the widely used performance 
measure in the statistical disclosure control literature. This indicates that it is possible 
to improve performance on the key statistics without hampering performance on cell 
deviation, and demonstrates the importance of incorporating “statistical properties 
preserving” optimization models in the traditional “minimizing cell deviation“ 
framework. 
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Fig. 2. Distribution of solutions w.r.t absolute cell deviation. 

The ordering heuristic as proposed in [3] first sorts the sensitive cells in descend- 
ing order and then assigns the directions for the sensitive cells in an alternating fash- 
ion. It is intended to find good solutions with respect to absolute cell deviation in a 
computationally efficient manner. We studied its performance on preserving statisti- 
cal measures by comparing its performance to that of the exact MILP method and to 
an optimal (nonlinear) solution. By conducting explicit enumeration of the feasible 
vectors it was determined that the nonlinear formulations preserve covariance. 

We used EXCEL-Solver to solve this nonlinear problem (which in general would 
be computationally impractical). Table 2 reports the comparison between the MILP 
and ordering heuristic methods. Given the variation in the quality of the solutions on 
the covariance and variance measures, the performance of the ordering heuristic was 
good and, in fact, the ordering heuristic improved performance on absolute cell devia- 
tion. This is consistent with [3], which reports superior performance of the ordering 
heuristic on cell deviation. 
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Table 2. Performance of ordering heuristic fin percent). 



Solution 

Method 


Covariance 

change 


Correlation 

change 


Variance A 
change 


Variance B 
change 


Absolute 

Cell 

Deviation 


Exact (MIP) 


3.15 


1.09 


5.94 


-3.22 


8.81e-l-7 


Ordering 

Heuristic 


5.34 


2.19 


4.49 


-4.56 


8.78e-l-7 


Performance 
w.r.t. optimal 


69 


100 


-24 


41 


-0.34 



5 Concluding Comments 

Developments over the past two decades have resulted in a variety of methods for 
statistical disclosure limitation in tabular data. Among these, controlled tabular ad- 
justment yields the most useable data product, particularly for magnitude data, 
thereby supporting broad opportunities to analyze released data. The question is then 
how well adjusted data preserve analytical outcomes of original data. Cox and Kelly 
[7] addressed this issue in the univariate case. This paper provides effective linear 
effective formulations for preserving key statistics in the multivariate case. 

There is an aspect of our formulations worth elaborating, namely that to the extent 
possible quality considerations be incorporated into the constraint system rather than 
the objective function. We find this approach more comprehensible and flexible than 
the traditional approach of defining an information loss measure and optimizing it 
over the tabular and nominal constraints. More importantly, our approach reflects a 
point of view that while mathematical optimization is an essential and flexible tool in 
modeling and solving quality/confidentiality problems, obtaining a mathematically 
optimal solution is of secondary, sometimes negligible, importance. This is because 
there are typically many solutions (adjusted tabulations) that by any statistical stan- 
dard, such as measurement error, are indistinguishable. The issue we claim is then to 
develop a mathematical programming model incorporating essential relationships and 
quality considerations, and to accept any feasible solution. The objective function, 
such as sum of absolute adjustments, is used merely as one among several quality 
controls. Consequently, near-optimal solutions are in most cases acceptable. This can 
be significant in reducing computational burden, especially for national statistical 
offices that deal with large and multiple tabulation systems. 

Related research not reported here looks at combining all constraints to produce a 
single set of multivariate solutions, e.g., ABC, that perform well on key statistical 
measures, at producing formulations for other statistical measures, e.g., for count data, 
the Chi-square statistic, and at investigating computational and scale issues. Space 
limitations preclude presenting here the original tables A, B, C and adjusted pairs AB, 
AC, BC, available from the authors. 
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Abstract. A heuristic for disclosure control in hierarchical tables was 
introduced in [5]. In that heuristic, the complete set of all possible sub- 
tables is being protected in a sequential way. In this article, we will show 
that it is possible to reduce the set of subtables, to a set that contains 
subtables with the same dimension as the complete hierarchical table 
only. I.e., in case of a three dimensional hierarchical table, only three di- 
mensional subtables need to be checked, not all two or one dimensional 
subtables. 

We performed the old and the new approach on a few testtables. Based 
on the results we conclude that neither approach outperforms the other 
for all testtables. However, conceptually we prefer the approach with the 
reduced set of subtables. 



1 Introduction 

One of the subjects of statistical disclosure control (SDC) that has attained a 
lot of attention, concerns the protection of tabular data using cell suppression. 
Often this concerns tables without any structure in the explanatory variables. 
In practice however, the explanatory variables often exhibit a hierarchical struc- 
ture. In case this holds for at least one of these variables, the table is called a 
hierarchical table. These kind of tables are hard to protect using cell suppression 
techniques: we have to take into account the many links between the different 
hierarchical levels. Indeed, when dealing with large hierarchical tables, the num- 
ber of relations that define a table grows exponentially. If we want to use the 
mixed integer approach of [2] for cell suppression, this would imply very large 
computation times. Moreover, the size of the problem might become too large 
to be dealt with as such. 

To be able to deal with large hierarchical tables anyhow, a heuristic approach 
was suggested in [5]. This approach is similar to the one described in [1], and 
partitions the large hierarchical table in all possible non-hierarchical subtables, 
induced by the hierarchies of the explanatory variables. The thus obtained set 
of subtables is dealt with in a specific order. This approach was implemented in 
the software package t-argus. For a description of this package, see [3]. 
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Since the set of (sub-)tables that is considered in that approach contains 
all possible non-hierarchical subtables, tables with a lower dimension than the 
hierarchical table are included as well. In the present paper we will show that 
it suffices to consider all possible subtables with the same dimension as the 
complete hierarchical table only. 

In Section 2 we will describe the reduction of the set of tables that need to be 
considered. This reduction has some effects on the way we have to deal with the 
tables in the subset. These adjustments will be discussed in the same section. 
In Section 3 we will present two theorems that show that indeed it suffices to 
consider only subtables with the same dimension as the complete hierarchical 
table. We have implemented the ideas presented in this paper and tested the new 
approach on some hierarchical tables. In Section 4 we will present the results of 
these tests and compare them with the results obtained using the old approach 
with the complete set of subtables. For both approaches we make use of the 
basic routines as implemented in the engine of r-ARGUS. Finally, in Section 5 we 
will summarize the findings of the present paper and formulate some conclusions 
concerning the use of the newly proposed approach. 



2 Reducing the Set of Subtables 

In this section we will describe the approach of [5] and explain the proposed 
reduction of the set of subtables to be considered. To that end, we will make use 
of the following example. 



2.1 A two Dimensional Example 

An example of a complete hierarchical table that is derived from two explanatory 
variables, is given in Figure 1. This table, including all subtotals, is called the 
basetable. The explanatory variables are defined in Figures 2 and 3. These (fic- 
titious) definitions show that, e.g., the regional variable R has four hierarchical 
levels: from level 0 (the topmost level) down to level 3 (the most detailed level). 
Similarly, the business classification variable BC has three hierarchical levels. 



2.2 Defining the Complete Set of Subtables 

The approach discussed in [5], is a top-down approach: a table T will be pro- 
tected when all tables containing the marginal cells of table T have already been 
protected. 

Obviously, all subtables have to be dealt with in a certain order. The old 
approach starts at the subtable consisting of the crossing of level 0 of all ex- 
planatory variables. I.e., it starts with the grand total of the basetable. Then 
it considers the subtables of the crossing of level 1 of one explanatory variable 
with level 0 of all the other explanatory variables. I.e., effectively it considers 
one-dimensional subtables, whose total is the grand total of the basetable. 
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R 

-PI 

-P2 



Lp3 



BC ^ 1 

I ^ ^ 1 A ^ 1 O 

LI MI SI LA SA 



C21 

-D211 

-D212 

C22 

G31 

G32 



300 


125 


41 


44 


40 


93 


51 


42 


82 


77 


31 


8 


12 


11 


26 


13 


13 


20 


128 


52 


21 


18 


13 


44 


24 


20 


32 


77 


25 


11 


9 


[5] 


31 


18 


13 


21 


35 


16 


4 


7 


[5] 


10 


6 


[4] 


[9] 


42 


9 


7 


[2] 


- 


21 


12 


9 


12 


51 


27 


10 


[9] 


8 


13 


6 


7 


11 


95 


42 


12 


14 


16 


23 


14 


9 


30 


45 


27 


8 


9 


10 


[5] 


[2] 


[3] 


13 


50 


15 


4 


5 


6 


18 


12 


6 


17 



Fig. 1. Example of a table defined by P x BC, [bold] means primary unsafe. 



Level 0 Level 1 



Level 2 Level 3 



( Provincel (PI) 



Region (P) < 



Province2 (P2) 



County21 (G21) 
County22 (C22) 



District211 (P211) 
District212 (P212) 



Province3 (P3) 



f County31 (G31) 
\ County32 (C32) 



Fig. 2. Hierarchical structure of regional variable P. 



Level 0 Level 1 Level 2 



Industry (/) 



All business (BC) 



Agriculture (A) 



t Other (O) 



{ Large Industry (LI) 
Medium Industry (MI) 
Small Industry (SI) 

{ Large Agriculture (PA) 
Small Agriculture (SA) 



Fig. 3. Hierarchical structure of business classification variable BC. 



This approach was formalised defining groups and classes of subtables: a 
crossing of levels oi, 02 , . . . , fln corresponding to n explanatory variables, yields 
subtables of group (oi, 02 , • . . , a„) in class k with k = ai + 02 + ■ — Van- Interior 
cells of these subtables will correspond to the categories at levels a^, whereas 
marginal cells of the same subtables will correspond to categories one level up 
in the hierarchy, i.e., at levels Oj — 1. 
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Note that class k might consist of several groups of tables: all groups that 
satisfy Moreover, each group might consist of several tables, de- 

pending on the number of parent categories: 

n 

1(01,02,.. .,a„)| = \\Pai , 

i=l 

where \Q\ denotes the number of tables in group Q and Pa^ the number of parent- 
categories of the categories of variable i at level ai (i.e., the number of categories 
one level above oi that have one or more subcategories), with the convention 
that Pq = 1. 

Figure 4 shows all classes and groups that are defined for the example of 
this paper. Note that, e.g., group (3,2) consists of two tables (since Pa^ = 1 and 
Faa = 2), whereas group (1,1) consists of only one table (since Pa^ = P 02 = !)• 
The total number of subtables of this example equals 20. In case one or more 
of the ai in a group equals zero, the subtables of that group will be called 
degenerate: essentially the dimension of those tables is lower than the dimension 
of the basetable, since Ui = 0 means that only the total of explanatory variable 
i is used. Moreover, that group will be called degenerate as well, since all tables 
of that group will be degenerate. 



Class Groups 

0 (0,0) 

1 ( 1 , 0 ), ( 0 , 1 ) 

2 ( 2 , 0 ), ( 0 , 2 ), ( 1 , 1 ) 

3 (2,1), (1,2), (3,0) 

4 (2,2), (3,1) 

5 (3,2) 



Number of subtables 
1 

1 + 1 = 2 
2+2+l=5 
2+2+l=5 
4+1 = 5 
2 



Fig. 4. Classes, groups and number of subtables defined hy R x BC. 



2.3 Defining the Set of Subtables 

As we have mentioned before, the original approach deals with all subtables 
defined using the classes and groups of the previous subsection. The approach 
starts with all tables of class 0, then moves its way down one class at a time. 
That way, primary unsafe marginal cells of tables in class i have been protected 
as primary unsafe interior cells of tables in classes i — 1 and z — 2 for z > 2. 
Obviously, tables in class 1 only have marginals that occur in tables of class 0 
and tables in class 0 do not exhibit marginal cells as such. Note that all tables 
in class z can be protected independently of each other, given the results from 
the previous classes. 

Using these observations, the original approach was to keep the status of 
marginal cells fixed when going from class i down to class z + 1. Hence the idea 
was, that save for the inhibited status of the marginal cells, each class could be 
protected independently of previous classes. Essential to this approach was, that 
degenerate tables were protected as well. 
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In some cases however, the problem of protecting the interior cells while 
keeping the marginals fixed, can be infeasible. Especially when a table consists 
of several empty cells (i.e., is sparse) or nearly empty cells along with some cells 
with large values (i.e., is skewed), this can happen. In these cases, it is necessary 
to add one or more secondary suppressions to the marginals. Suppose that this 
table belonged to class i. Since the marginal cells of that table were interior cells 
of tables in class i — 1 (and possibly i — 2 when the grand total of the subtable 
is involved), those tables of class i — 1 need to be reconsidered. This is called 
backtracking. 

In ‘real-life’ applications, the sparseness or the skewness of tables is very 
much apparent in hierarchical tables with high detail and hence suggests that 
backtracking will often be needed. Thus the need to protect degenerate tables 
on their own, seems no longer necessary. 

In the new approach, we will not consider degenerate tables on their own, 
but only as marginals of higher dimensional tables. If one still wants to restrict 
secondary suppressions to interior cells as much as possible (e.g., to try to keep 
the number of backtrackings limited), the cost for suppressing marginal cells 
could be kept artificially high. That way, secondary suppressions in the marginals 
needed to protect primary unsafe marginal cells would still be found, whereas 
primary unsafe interior cells would be protected suppressing other interior cells 
whenever possible. 

Not considering degenerate tables on their own, would imply that we have to 
protect significantly less subtables. E.g., the classes and groups to be considered 
in our example in this new approach are given in Figure 5 and the number of 
subtables is reduced from 20 to 12. In the next section we will show that indeed 
it suffices to consider subtables with the same dimension as the basetable only. 
Moreover, we will show in which cases it will be necessary to backtrack because 
of secondary suppression of marginal cells. 



Class 


Groups 


Number of subtables 


2 


(1,1) 


1 


3 


(2,1), (1,2) 


2 + 2 = 4 


4 


(2,2), (3,1) 


4 + 1 = 5 


5 


(3,2) 


2 



Fig. 5. Reduced classes, groups and number of subtables, defined by i? x BC. 



3 Theoretical Background 

In the first theorem we will show that it suffices to consider only subtables of 
the same dimension n as the basetable. 

Theorem 1. Consider n hierarchical explanatory variables with at least two 
levels each (level 0 and level 1). Let Gd = (ai, 02 , • ■ • ,cin) be a degenerate group 
of subtables, defined by these variables. Denote the class this group belongs to by 
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k and the number of explanatory variables at level 0 (i.e., the number of Oi equal 
to 0) by m. Then all tables ofQd will be protected while protecting the subtables 
of a specific non-degenerate group of class k m. 

Proof: Define a new group of tables (&i, . . . , b„), where 

^ if Oj = 0 ’ 

Note that bj > 0 for all j = 1, . . . , n and ~ cij + m = k m. I.e., 

this group belongs to class k m and is non-degenerate. Since by construction 
the tables of group Qd are marginals of the tables of group (&i, . . . , 6„), they will 
be protected automatically while protecting the tables of group (5i, . . . , 6„). □ 

Considering non-degenerate subtables only, secondary suppressions that appear 
in marginal cells will not always induce backtracking: in some cases the current 
subtable is the first to include those suppressed marginal cells. In other cases 
those suppressed marginal cells are part of previously protected tables. Only in 
these cases we would have to backtrack. 

The next theorem formalises in which cases backtracking is needed. 

Theorem 2. Let T be a non- degenerate table of group (oi, . . . , a„) . Assume that 
a marginal cell C of T is needed as secondary suppression. Then, backtracking 
is needed if and only if C corresponds to a subtotal of at least one variable i with 

Oi > 1. 

Proof: First we will show the if-part, i.e., if C corresponds to a subtotal of at 
least one variable i with Oi > 1, backtracking is needed. 

Denote the class to which the group under consideration belongs to by k, 
hence k = X^r=i Without loss of generality, assume that C corresponds to 
a subtotal of variable 1 with oi > 1. Then C is a cell at level Oi — 1 of that 
variable. I.e., C is an interior cell of a subtable of group (oi — 1, 02, 03 , ... , a„). 
That subtable is non-degenerate and belongs to class k—1. That table has been 
considered before and hence backtracking is needed. 

Next, we will show the only-if-part, i.e., if C corresponds to subtotals of 
variables i with = 1 only, backtracking is not needed. I.e., we will show that 
in that case C does not belong to any other non-degenerate subtable in any of 
the classes 0 up to and including k. 

Without loss of generality, assume that C corresponds to subtotals of vari- 
ables 1 up to io, with oi = 02 = • • • = Oig = 1. Forcing C to be an interior cell 
of a subtable by decreasing the level of at least one of these variables, would 
result in a degenerate table. Hence in that case backtracking is not needed. If 
on the other hand the level of one of the other variables is decreased, variable 
j > io say with level aj, the cells in the resulting subtables would correspond to 
at most level Oj — 1. Since C corresponds to level Oj, it does not occur in that 
lower class and again backtracking is not needed. 

Moreover, class k itself does not contain any other non-degenerate table in 
which C occurs as a marginal cell either. Indeed, any group of class k that con- 
tains C as a marginal cell with the first io variables at level 1, would have to 
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Table 1. Information on the structure of the test tables. 



Table 


Variables 


Levels 


Categories 


A3 


X, Y, Z 


0-5, 0-4, 0-2 


55, 19, 17 


A2 


Y, Z 


0-4, 0-2 


19, 17 


B3 


X, Y, Z 


0-3, 0-1, 0-4 


70, 11, 595 


B2 


Y, Z 


0-1, 0-4 


11, 595 


C3 


X, Y, Z 


0-6, 0-3, 0-5 


137, 16, 713 


C2 


X, Z 


0-6, 0-5 


137, 713 



be of the form (1, . . . , 1) • j hi), with ii = n — io and X)7=i ~ ^ ~ *o- 

Since C corresponds to level aj of variable j with j ^ the corre- 

sponding bi with i = j — iQ can only be equal to aj or to Oj T 1. The restriction 
EUh = k -to = then implies that these bi can only be equal 

to the corresponding aj itself. So any other table in class k containing (7 as a 
marginal cell would belong to the same group. However, these tables are obtained 
from crossings of the variables on the given levels with different subcategories, 
and thus can not contain cell C. □ 

4 Some Test Results 

In order to show the differences between the new and the old approach, we have 
implemented both using Microsoft Visual C-|— b, and applied them to some tables. 
Moreover, we have made use of routines provided by J.J. Salazar-Gonzalez to 
protect each subtable using the mixed integer approach as described in [2]. For 
these routines, either GPlex or Xpress is needed (commercial LP-solvers). The 
tests have been performed on a machine with a Pentium 4 processor (1.5 Ghz, 
256 MB), with Windows 2000 Professional and GPlex version 7. 

4.1 The Test Tables 

In this subsection we will describe the tables that have been used in the tests. 
In order to test the behaviour of both approaches on realistic tables, we used 
real GBS data. However, to avoid any possible disclosure in this paper, we will 
describe each table in general terms, i.e., specifying neither the exact cell- values 
nor the explanatory variables. In Table 1 the name of each test table is given, 
along with the (pseudo) names of the explanatory variables, their levels and 
the total number of categories (including all subtotals). Three different data sets 
were used to produce the test tables. Each data set provided two tables: one two- 
dimensional table and one three-dimensional table. The two-dimensional tables 
are in fact subtables of the three-dimensional tables. For each table the primary 
unsafe cells have been detected according to a certain linear sensitivity measure. 
The safety ranges required for these primary unsafe cells were calculated in ac- 
cordance with the applied sensitivity measure. In Table 2 the number of primary 
unsafe cells, the number of empty cells and the number of safe cells is given for 
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Table 2. Information on the number of cells of the test tables. 



Table 


Total 


Primary 


Empty 


Safe 


A3 


17 765 


3 789 (21.3%) 


9 974 (56.2%) 


4 002 (22.5%) 


A2 


323 


6 (1.9%) 


68 (21.0%) 


249 (77.1%) 


B3 


457380 


80 407 (17.6%) 


296 260 (64.8%) 


80 713 (17.6%) 


B2 


6 534 


652 (10.0%) 


653 (10.0%) 


5 229 (80.0%) 


C3 


1562 896 


309 412 (19.8%) 


1047410 (67.0%) 


206 074 (13.2%) 


C2 


97681 


32 508 (33.3%) 


23 955 (24.5%) 


41218 (42.2%) 



Table 3. Information on the classes and groups in the old and the new approach. 



Classes Groups (Sub)tables 



Table 


old 


new 


old 


new 


old 


new 


A3 


12 


9 


90 


40 


960 


675 


A2 


7 


5 


15 


8 


60 


45 


B3 


9 


6 


40 


12 


1568 


715 


B2 


6 


4 


10 


4 


112 


55 


C3 


15 


12 


168 


90 


56 672 


47 250 


C2 


12 


10 


42 


30 


8 096 


7875 



each test table. As described in this paper, the new and the old approach differ in 
the number of classes, groups and subtables that will be considered by r-ARGUS. 

In Table 3 the number of classes, groups and (sub)tables as defined in the 
old approach as well as in the new approach is given. In both approaches, the 
subtables are protected using the optimisation routines as described in [2]. These 
routines minimize the total cost of all suppressed cells, given the restriction that 
the prescribed safety ranges are met for each primary unsafe cell. Since the choice 
of the cost of suppressing cells is arbitrary, we used two different costfunctions: 
one that minimizes the total number of suppressions (tables A2, A3, C2 and C3) 
and another one that minimizes the total cell value of the suppressed cells (tables 
B2 and B3). In both situations, the costs of marginal cells of each subtable were 
calculated in such a way, that suppressing a marginal cell would cost more than 
suppressing all the interior cells it corresponds to. 

4.2 The Results 

The time needed to find protection patterns for the test tables is shown in 
Table 4. This table shows the total CPU time as well as the CPU time spent 
inside the optimisation routines of [2]. 

By reducing the number of subtables to be considered, one would expect 
the total CPU time to diminish. However, as Table 4 shows, this is not always 
the case. In fact, often the new approach spent more time in the optimisation 
routines than the old approach. Since the old approach deals with the marginals 
of the tables of the new approach separately, this will sometimes simplify the 
protection of higher dimensional tables. Hence, the same table dealt with in the 
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Table 4. Total CPU time and time spent in optimisation routines in seconds. 



Total Optimisation 



Table 


old 


new 


old 


new 


A3 


612 


662 


562 


627 


A2 


0.52 


0.59 


0.27 


0.25 


B3 


338 228 


373 919 


337484 


373 244 


B2 


65 


63 


64 


61 


C3 


28 023 


26 865 


18134 


20 405 


C2 


577 


549 


400 


400 



Table 5. Number of backtrackings, secondary suppressions, and percentage of sec- 
ondary suppressed cell value. 



Backtrackings Secondary cells Cell value (%) Total cell value 



Table 


old 


new 


old 


new 


old 


new 


(xlO’’) 


A3 


42 


32 


2 442 


2 405 


61.06 


53.43 


6 231591 


A2 


1 


1 


16 


16 


1.72 


1.72 


1 265 970 


B3 


31 


32 


23 457 


22 859 


3.40 


3.42 


173 294 


B2 


0 


0 


216 


216 


0.11 


0.11 


63 374 


C3 


157 


123 


113 787 111942 


25.82 


24.23 


7 703 030 


C2 


20 


21 


13 353 


13 362 


7.07 


6.98 


1954498 



new approach can be harder to protect than dealt with in the old approach. Ta- 
ble 5 shows the effect of both approaches on the number of times backtracking 
was needed. In most cases, the new approach shows a lower number of back- 
trackings. In the same table, the number of secondary suppressions found by 
both approaches is given. The sum of suppressed cell values (excluding the pri- 
mary suppressions) relative to the total of all cell values is given as well. Note 
that, since the basetable contains several subtotals due to the hierarchies of the 
spanning variables, the total cell value is (much) larger than the cell value of the 
grand total. In the right most column of this table, that total cell value is given 
for reference. In almost all cases, the number of secondary suppressions and the 
sum of the suppressed cell values with the new approach, are lower than or equal 
to those found with the old approach. The only exceptions are table C3 for the 
number of secondary cells, and table B3 for the suppressed cell value. Note, that 
for both cases, the results are in contrast to the costfunctions used. However, 
both approaches are such that each sw&table is protected, with the minimum 
cost of suppressed cells. Consequently, neither approach will necessarily find the 
optimal suppression pattern for the complete hierarchical table. Another choice 
for a pattern in a specific subtable, will influence the patterns for subtables on 
lower levels. A ’’better” pattern for one subtable, may lead to an ultimate pattern 
for the complete table, which is actually worse. Moreover, a specific problem as 
described in [4], might still be present, since both approaches deal with classes of 
non-hierarchical subtables only. Another consequence of both approaches is that 
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secondary suppressions need to be passed on to lower level tables as primary 
suppressions with certain safety ranges. It is not at all trivial (if not impossible) 
to calculate the needed safety ranges exactly, hence a certain choice was made. 
This might influence the attained suppression pattern and hence the realised 
safety ranges of the ‘real’ primary cells as well. 

To assess the depth of the secondary suppressions within the hierarchical 
structure, we assigned certain weights to each of the cells of the hierarchical 
table. These weights represent the hierarchical levels of each cell: a low weight 
corresponds to a cell high up in the hierarchical table (e.g., the cell with the 
grand total) and a high weight corresponds to a cell deep down in the hierarchical 
table (e.g., a cell at the lowest level of each of the spanning variables). This was 
attained by assigning the following weight to a cell at levels (oi, . . . ,a„) of an 
n-dimensional table (i.e., an interior cell in a subtable in group (oi, . . . , a„)): 



1 " 






where Lj is the maximum level of variable j. Note that, since aj G {0, . . . ,Lj} 
the weight Wi can attain values between 0 and 1. Indeed, the grand total will 
receive a weight 0, whereas a cell at the lowest level of all hierarchies will obtain 
a weight of 1. 

As an example, consider a 2-dimensional hierarchical table with Li = 5 and 
L 2 = 9. Then a cell at level (2,4) would obtain a weight of 19/45 whereas a cell 
at level (4,2) would obtain a weight of 23/45. Hence, the latter cell is in the 
lower half of the table and the former one in the upper half of the table. 

Once the weights are assigned to each of the cells, we can calculate the 
percentage of secondary suppressions less then or equal to a certain threshold. 
This represents the portion of the secondary suppressions in the higher parts of 
the hierarchies. Often users want this to be small. In Table 6 the percentages 
are given for several values of the threshold. The results show that most of 
the secondary suppressions lie in the lower parts of the tables. In all cases, the 
percentage of the suppressions in a specific upper part of the tables is lower with 
the new approach than with the old approach. 



5 Summary and Conclusions 

In the software package r-ARGUS, a heuristic approach for the protection of 
hierarchical tables is used, which is described in [5]. In this paper, we presented an 
adaptation to this approach which substantially reduces the number of subtables 
to be considered by r-ARGUS. We showed that not all non-hierarchical subtables 
need to be considered, but that it is sufficient to restrict to subtables with the 
same dimension as the complete hierarchical table. To test the performance of 
the new approach, both the old and the new approach have been applied to some 
test tables. 

The results show no great differences between the approaches, but for some 
criteria one approach seems to be slightly better than the other, while for other 
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Table 6. Percentage of secondary suppressions with weight less then or equal to the 
threshold. 



Threshold 



Table 


0.25 

old new 


old 


0.50 

new 


0.75 

old new 


A3 


1.72 


1.29 


19.78 


17.96 


73.87 


71.77 


A2 


0 


0 


0 


0 


25.00 


25.00 


B3 


0.00 


0.00 


4.18 


4.05 


25.81 


25.40 


B2 


0 


0 


0 


0 


0 


0 


C3 


0.08 


0.07 


15.59 


15.24 


74.86 


74.37 


C2 


0 


0 


1.60 


1.54 


52.89 


52.83 



criteria, it is the other way around. Concerning the CPU-time needed, the old 
approach seems somewhat better. However, the resulting suppression patterns 
for the complete tables found by the new approach appear to be slightly superior 
to the patterns found by the old approach. This is an argument to use the new 
approach in the future. Another, more intuitive, argument is the following. The 
only reason to invent the heuristic for hierarchical tables was the impossibility of 
using the suppression algorithm of [2] . This algorithm cannot be used in practice 
because of the huge number of relations needed to define a hierarchical table. 
This leads to very large computing times or problems that become too large 
to be dealt with. The heuristic deals with the hierarchical situation by dividing 
the complete table in subtables. However, we would still like to protect parts of 
the table as large as possible. In the old heuristic approach, subtables with a 
lower dimension than the complete table are considered for protection. This is 
no longer needed in the new approach, so in effect we will be protecting larger 
parts of the table. Therefore, we recommend the new approach. 
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Abstract. Statistical agencies make use of waivers to increase the amount of in- 
formation that they release in tables of business data. Waivers, obtained from 
large contributors to sensitive cells, allow the agency to publish the cell value 
even though doing so may lead to the disclosure of these large contributors’ 
values. Waivers allow sensitive cells to be released and also reduce the need to 
suppress nonsensitive cells to prevent the derivation of these sensitive cells’ 
values. As waivers are not easy to obtain it is desirable to identify businesses 
whose waivers will have the largest impact on the amount of publishable infor- 
mation. Two heuristic approaches are presented for the identification of priority 
cells for waivers that factor in properties of neighboring cells. Empirical inves- 
tigation on two business surveys shows that the approaches have merit but that 
the structure of the tabular data plays an important role in the results obtained. 



1 Introduction 

The suppression of information (cells) in published tabulated survey data is a tech- 
nique commonly used to preserve the confidentiality of respondents’ data. In business 
surveys, the primary cell suppression patterns are usually driven by two criteria: a 
minimum number of respondents per cell and dominance rules where, for example, a 
cell could be suppressed if a business accounts for more than 80 % of the cell total. 

Due to the highly skewed nature of business populations, the suppression of cells 
frequently stems from the presence of one or a few businesses that dominate a sector. 
The suppression of a sensitive cell will likely trigger the suppression of other cells in 
order to reduce the risk of residual disclosure. In an effort to publish more cells, waiv- 
ers are sometimes obtained from large contributors that allow the publication of the 
cell’s value despite the risk of disclosure of their data. These businesses are often 
selected in an ad hoc manner. 

The decision to obtain a waiver from a respondent for a given cell and not another 
may lead to different suppression patterns, some of them can be more optimal than 
others according to user requirements. 

In this paper, we try two approaches for identifying potential waivers. The first is 
based on a mixed score function that assigns scores to cells and to the blocks to which 
they belong (blocks are multidimensional groups of cells adding up to a common 
subtotal cell). The scores are based on factors such as the size of the cell or block, the 
amount of suppressed information and the number of waivers required. Priority for 
waivers is given to higher scoring blocks and cells. The second approach is based on 
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attributing to each sensitive cell a score that is an indication of the minimum amount 
of suppression required to protect it along each of its axes in the table (i.e., along one- 
dimensional segments leading up to a subtotal). 

When compared to a strategy targeting the largest sensitive cells for waivers, the 
mixed score approach gave good results, in terms of total value releasable after waiv- 
ers, when applied to data from the 1997 Annual Survey of Manufacturing (ASM). 
When adapted to the 2000 Annual Wholesale Trade Survey (WTS), the method gave 
mixed results. This led to the development of the second approach, called the segment 
cost approach. When tried with the WTS this approach resulted in a higher releasable 
total value, for the same number of waivers. It did, however, accomplish this at the 
cost of more suppressed cells. 

The paper starts with concepts and definitions related to confidentiality. Section 3 
presents some features of CONFID, Statistics Canada’s cell suppression program. The 
mixed score approach is described in section 4 and evaluated using the ASM in sec- 
tion 5. Section 6 discusses the adaptation of this approach to the WTS. The segment 
cost approach and results are presented in section 7. Section 8 gives concluding re- 
marks and points to aspects that require further development. 



2 Concepts and Definitions 

2.1 Waivers and Sensitivity 

A waiver is a written agreement between a statistical agency and a respondent where 
the respondent gives consent to the release of their individual respondent information 
[2]. A respondent may be a person, organization or business. The term "waiver" is 
used since the respondent is waiving the right to the usual confidentiality protection. 
Under this agreement, the respondent releases the statistical agency from liability 
regarding its obligation to preserve the confidentiality for his data. 

Waivers permit Statistics Canada to release data, for example in publications or 
off-the-shelf tabulations, where those data could not otherwise be released for reasons 
of confidentiality (e.g., where table counts contain one or two businesses or where 
one or two businesses dominate). Waivers can also be used to allow the release of 
individually-identified respondent data. 

Sensitivity is used to describe and quantify the risk of potential disclosure within a 
cell. A cell is deemed sensitive if the risk of disclosure is considered too high. The 
(n,k) rule for identifying sensitive cells can also be expressed as a sensitivity function. 
For example, with the (3,75) rule the sensitivity function s(x) can be formulated as 
follows: 



100 r 1 (1) 

where t(x) is the cell’s total value, and X^^^, are the values of the three largest 

contributors. A cell would be qualified as sensitive if s(x)>0, which in this example 
would occur if the largest three respondents accounted for more than 75% of the cell 
total value. The value taken by s(x) along with other criteria can also be used to de- 
termine the cell sensitivity. More details on this concept can be found in [3]. 
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2.2 Cells, Blocks and Complementary Suppressions 

The cell and the block are the two first levels in a published table. A cell corresponds 
to the most detailed element that is being published. In business surveys, a cell is 
usually defined using two or three characteristics - the cell's dimensions. Typically, it 
will include a detailed industrial dimension like a subset of the four-digit Standard 
Industrial Classification (SIC) codes and a geographical dimension, for example, a 
province. A block is an aggregation of cells (e.g., four-digit SIC by province) that 
includes totals (or sub-totals) for the first dimension categories, totals for the second 
dimension categories and the grand total. 

Cells identified as sensitive are suppressed. The presence of a sensitive cell in a ta- 
ble may require the suppression of one or more other cells in order to prevent the 
value of the sensitive cell from being derived. The choice of these complementary 
cells is typically determined by a program using linear programming techniques and 
can depend on various parameters including a cost function defined for nonsensitive 
cells which one attempts to minimize. Examples of cost functions are to assign a con- 
stant cost to each suppressed cell (to minimize the number of complementary cells), 
to assign a cost equal to the suppressed cell's value (to minimize the total value sup- 
pressed), and to assign a cost equal to log(lH-value)/(l-Hvalue) for suppressed cells 
(called information cost). The exact strategy will depend on user objectives for the 
survey and the publications [5]. 



3 CONFID Suppression Program 

CONFID is a Statistics Canada program that identifies sensitive cells according to 
users’ confidentiality rules [6]. It also identifies all the complementary cells that 
should be suppressed in order to produce the tabulations. For a given problem, i.e., a 
set of tabulations to produce, there are many possible solutions. Those solutions are 
called confidentiality patterns. CONFID finds a near optimal solution according to 
some pre-specified criteria, including minimizing a cost function such as described 
above. Some surveys use a two-pass approach in identifying a suppression pattern. In 
the first pass, complementary suppressions are identified using a value cost function. 
A second pass is done using the information cost function but involving only sup- 
pressed cells from the first pass (i.e., only sensitive cells and cells used as comple- 
ments the first time can be used to protect sensitive cells). In this second pass, some 
of the complementary cells from the first pass become publishable. CONFID has 
other features such as the treatment of common respondents between cells and the 
joint protection of groups of cells. For more details see [3]. 

The confidentiality patterns produced may lead to the suppression of a consider- 
able amount of information for surveys that produce very detailed tabulations com- 
posed of many cells with few units. In these particular cases, the use of waivers can 
help to reduce the amount of suppressed information. If the respondent or respondents 
responsible for the sensitivity of a cell all provide waivers then that cell and other 
complementary cells that are suppressed to protect its value may be published, leading 
to substantial gains in information. 
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4 The Mixed Score Approach 

It is not practically feasible to obtain waivers for all units in sensitive cells because of 
cost and operational constraints - obtaining waivers may require contacting an enter- 
prise a number of times. Careful selection of the cells that contain potential waivers is 
necessary. The mixed score approach identifies systematically which cells and eligi- 
ble waivers should be prioritized. The goal is to find the waivers within specific cells 
that will yield an optimal suppression pattern by allowing the publication of more 
information. To achieve this we aimed for sensitive units that are responsible for the 
largest amount of suppressed information in terms of the sensitive cells themselves 
and their complements. To be able to rank potential waiver requests, we sought to 
establish some criteria. We based these criteria on the characteristics of the cell and of 
the block containing the sensitive unit. 

4.1 Cell Criteria 

We identified two factors to determine the cell score: the numbers or waivers required 
and the cell size. For each criterion a score is obtained and transformed into a per- 
centage. These percentages are combined to get a unique cell score where both crite- 
ria are weighted according to their relative importance. 

Number of Waivers. Since obtaining waivers from respondents is time consuming, 
we favour cells that require a smaller number of waivers. This criterion is calculated 
using the ratio of the number of waivers required to make a cell publishable over the 
maximum number of potential waivers required among all the sensitive cells. The 
criterion yields a higher score if there is a smaller number of waivers. It is defined by: 

C€j = 100 (max(x) -x) / max(x ) , 

where x is the number of waivers that have to be requested to remove the cell’s sensi- 
tivity and max(x) is the largest possible value of x across all the sensitive cells. 

The number of waivers required is calculated sequentially. For example, with a 
(1,80) rule if the largest respondent accounts for more than 80% of the total value, the 
second largest respondent accounts for over 80% or the total excluding the largest 
respondent, and the third largest respondent accounts for over 80% of the remaining 
total (after removing the largest two respondents) then it is determined that three 
waivers are required. 

Cell Size. We wish to publish cells that are larger contributors for the main variable 
of interest. We define our criterion as: 

ce^ = 100 y / max(y ) , 

where y is the sensitive cell’s total for the variable of interest and max(y) is the largest 
possible value of y across all the sensitive cells. The ce^ value will be proportional to 
the y value. 

Using these two cell criteria we define a function to score and rank each cell as fol- 
lows: 
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f(ce) = w, C6j + ce, , 

where the w, are the weights applied in light of the criterion’s deemed importance 
(Wj+W 2 =l). Both weights were set to one-half after consultation with subject matter 
specialists. The f(ce) score is expressed as a percentage. 



4.2 Criteria for a Block 

We define four criteria at the block level in order to identify the greatest gains when 
using waivers: the number of waivers required before a block becomes fully publish- 
able, the size of the block based on the main variable of interest, the relative impor- 
tance of the suppressed confidential and complementary information, and the number 
of complementary cells suppressed in the block. 

Each criterion is derived and expressed as a percentage. The weighted contribu- 
tions to each percentage are combined to obtain a unique score. 

Number of Waivers. We favour blocks that require a smaller number of waivers. We 
calculate the ratio of the number of waivers required to make sensitive cells inside the 
block publishable over the maximum number of potential waivers required among all 
the blocks where we observe sensitivity. It is defined by: 

bl, = 100 (max(x) -x) / max(x ) , 

where x is the total number of waivers that have to be requested to remove the sensi- 
tivity of cells within the block and max(x) is the largest possible value of x across all 
the blocks. The blj value will be larger when a smaller number of waivers is required. 

Block Size. We want to be able to publish the blocks which are the largest contribu- 
tors. This criterion considers the relative size of the block in terms of the main vari- 
able of interest as follows: 



bl^ = 100 y / max(y) , 

where y is the block’s total for the variable of interest and max(y) is the largest possi- 
ble value of y across all the blocks which contain sensitive information. The bl^ value 
is directly proportional to the relative importance of y. 

Suppressed Information. One of our goals is to reduce residual disclosure. Hence, 
we take into account the quantity of concealed confidential and residual information 
in a block where we observe sensitivity in some of its cells. We define: 

blj = 100 ( y,+ y , ) / max( yj+ ) , 

where, within the block, y, and y^ are the total amounts for the variable of interest in 
sensitive and in complementary cells, respectively. The variable max( y,+ y 2 ) is the 
largest possible value of these two variables across all the blocks where information is 
hidden for confidentiality purposes. The bl^ value will be larger if a large quantity of 
information is suppressed. 
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Complementary Cells Suppression. We want to increase the number of cells pub- 
lished. This criterion identifies the blocks where the largest potential gains would be 
obtained when using waivers. We rank blocks according to the number of comple- 
mentary cells suppressed using: 



bl^ = 100 X / max(x ) , 

where x is the number of complementary cells and max(x) is the largest possible value 
of X across all the blocks. The bl^ value will be larger if many cells are suppressed. 

Using the four criteria for the blocks, we create a function to score and rank each 
block as follows: 



f(bl) = Wj bl, + Wj bl^ + Wj bl, + bl^ , 

where the bl. variables represent each criterion’s value. The w. variables are the 
weights applied according to the criterion’s deemed importance. After consultation 
with subject matter specialists the four w, variables were all given the same value of 
one-quarter. The/ffo/j score for a block is expressed as a percentage. 

By taking the two score functions for the cell [f(ce)] and the block [f(bl)], we can 
establish the relative priority of each cell by sorting each cell according to, first, its 
block's score and, second, its own score (cell level). 



5 Practical Example for the Mixed Score Approach 

To assess the efficiency of the mixed score approach we applied the technique to the 
data collected for the food industry by the 1997 Annual Survey of Manufacturing 
(ASM). In 1997, roughly 3100 units in the food industry contributed $50.4 billions in 
manufacturing shipments. Data were tabulated to cover a total of 319 cells for vari- 
ous industry and geographic levels (including totals and subtotals). Sensitive cells 
were identified using the agency’s Duffett Rules, which are sets of (n,k) rules whose 
parameters are confidential. Without using waivers, roughly 55% of the cells would 
be published. 

We compared two methods to prioritize the cells with potential waivers. One con- 
sisted of ranking candidate cells for waivers according to their total shipment values. 
The other method prioritized the cells through the use of the above score functions. 
Methods were compared in terms of number of cells publishable and percentage of 
total shipments. Results were examined in the cases where a maximum of 10 waivers 
are obtained out of a total of 65 identified potential waiver candidates for two sub- 
levels of the Standard Industrial Classification in the food sector: Poultry Products 
Industry (SIC-4) and Meat and Poultry Products Industry (SIC-3). 

The identification of waivers with the mixed score approach led to interesting gains 
at the most detailed level (SIC-4) which was the level that was more affected by sen- 
sitivity. An additional 15 cells could be published if waivers were obtained for the 
candidates identified using the score functions. These extra cells represented 5.5% of 
total manufacturing shipments and corresponded to close to one half of the manufac- 
turing shipments that was suppressed (12.4%). We also observed important gains in 
geographical areas where waivers had an impact. The additional cells that could be 
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Table 1. Number of Publishable Cells and Percentage of Shipments Publishable under 3 Sce- 
narios. 



Industry 


Cells 


Waivers not used 


10 waivers selected 


10 waivers selected 


level 


available 






using total shipments 


using mixed 


score 






Cells 


Shipments 


Cells 


Shipments 


Cells 


Ship- 
















ments 


SIC-4 


218 


100 


87.6% 


105 


90.2% 


115 


93.1% 


SIC-3 


88 


61 


97.1% 


61 


97.4% 


62 


97.1% 



released represented between 6% (in larger areas) to 34% (in smaller areas) of the 
total shipment values. The mixed score approach also performed better than the 
method of selecting waiver candidates according to their cell’s shipment values. At 
the more aggregated industrial level, the SIC-3 level, the use of the score functions 
did not lead to substantial gains. 



6 Application to the Wholesale Trade Survey 

The follow-up study originally aimed to expand the results of the mixed score ap- 
proach, and possibly to refine it. A different survey was chosen, the 2000 Wholesale 
Trade Survey (WTS), which covered about 8900 enterprises with sales of $ 2.8 bil- 
lions in the NAICS Wholesale Trade Sector (the North American Industry Classifica- 
tion System replaced the 1980 Standard Industrial Classification System). The WTS 
also used the Duffett Rules for sensitivity, but it did not use waivers. For purposes of 
this study the classification variables used were industry (78 NAICS-5 digit industries 
ordered in 17 Trade Groups), province/territory (13) and type of revenue (3 were 
used: sales, labour income and rental income). Type of revenue was used to allow 
three-dimensional studies. 




TRADE GROUP / INDUSTRY 



Fig. 1. Structure of the Wholesale Trade Survey Data. 

Figure 1 illustrates how the WTS data are structured. The first dimension, industry, 
is organised hierarchically, although some of the trade groups had few industries (4 
had only one industry, 2 had two industries, 4 had three industries). Blocks are two 
dimensional "slices", within trade groups, of industry, province/territory, or type of 
revenue. 
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The follow-up study concentrated on measuring the total value suppressed by 
CONFID. The characteristic "number of complementary cells suppressed" was 
dropped from the block score function. Two and three-dimensional variations were 
tried involving industry at both the NAICS 4 and 5-digit levels within trade groups. 
The three territories, accounting for around 1% of units and 0.1% of sales, were even- 
tually dropped because they did not affect results but contributed a lot of zero cells. 

Different numbers of waivers required were identified. As before, waivers were 
"treated" by removing the sensitivity of the corresponding cell (and, if it matched 
exactly to one of its marginal cell, of that cell as well). For comparison purposes, 
waivers were also selected for the same number of units using the approach of starting 
with the largest-valued sensitive cells, henceforth referred to as largest-first approach. 

The mixed score approach did not do as well compared to the largest-first ap- 
proach. In a two-dimensional analysis crossing NAICS-5 industry with prov- 
ince/territory, 22 waivers helped reduce the number of suppressed cells from 367 to 
362, but the total value of sales suppressed dropped from 19.37 millions to 15.76 
millions. However, obtaining 22 waivers from the largest sensitive cells resulted in 
359 suppressed cells with a total value of only 10.29 millions. There were two main 
reasons for the better performance of the largest-first approach. Firstly, it used an 
exceptionally large sensitive cell that was not in the identified blocks because its 
block required too many waivers. Secondly, not all the sensitive cells in identified 
blocks were large enough or influential enough to merit treatment. The block score 
approach “wasted” waivers on those cells - suggesting that the block and cell scores 
need to be combined better. It was also noteworthy that this particular two- 
dimensional problem had so many sensitive cells (295 out of 1 122 nonzero cells) that 
many served as complements for each other, reducing the need for complementary 
suppressions among nonsensitive cells (only 72 were needed). This explains why the 
waivers did not reduce by much the number of suppressed cells - removing the sensi- 
tivity of a cell did cause much of a “domino effect” on complementary suppressions. 

Two and three-dimensional results were examined with more aggregated industry 
(NAICS-4 industry within trade group) and excluding the territories. A three- 
dimensional problem at the NAICS-5 industry level was also examined. Results are 
shown in table 2. For each situation and number of waivers we give the number of 
cells suppressed and the total value for those suppressed cells. The segment cost ap- 
proach will be explained in the next section. 



Table 2. Number of cells suppressed, and value suppressed for different scenarios. 





Waivers not 
used 


Waivers selected 
for largest cells 


Selection using 
mixed score 


Selection using 
segment cost 


# cells 


$ mil. 


# cells 


$ mil. 


# cells 


$ mil. 


# cells 


$ mil. 


Two-Dimensional Analysis (NAICS-4 by Province: 528 cells, including 5 zero cells) 


10 waivers 


68 


13.12 


59 


10.15 


60 


4.04 


62 


3.25 


17 waivers 






57 


5.10 


50 


3.93 


60 


2.64 


Three-Dimensional Analysis (NAICS-4 by Prov. by Rev. Type: 2112 cells, inch 253 zero) 


10 waivers 


768 


281 


758 


273 


765 


279 


765 


262 


19 waivers 






752 


275 


759 


276 


776 


246 


27 waivers 






746 


265 


746 


265 


767 


228 


Three-Dimensional Analysis (NAICS-5 by Prov. by Rev. Type: 4224 cells, inch 814 zero) 


20 waivers 


1832 


387 


1815 


400 


1826 


384 


1837 


345 
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The two-dimensional problem with 10 waivers gave better results with the mixed 
score approach than with the largest-first approach (i.e., based on cell size). This dif- 
ference disappeared when 17 waivers were used, although there was a bigger gap in 
the number of suppressions. With NAICS-4 and three dimensions we notice that the 
methods used did not reduce the number of suppressed cells or their value by a large 
relative amount. The largest-first approach gave slightly better results than the mixed 
score approach. In the former, the amount suppressed increased slightly as we went 
from 10 waivers to 19. 

With the three dimensional NAICS-5 approach, there were so many sensitive cells 
(over 1000) that we stopped at 20 waivers, which was not nearly enough to cover 
sensitive cells in even the first block. With the mixed score approach the amount of 
suppressed information decreased slightly from $ 387 millions to $ 384 millions 
whereas with the largest- first approach it actually increased to $ 400 millions. 

This last result was not impossible. CONFID solves the complementary suppres- 
sion program one cell at a time, starting with the most sensitive one. It is possible that 
very large sensitive cells would have made good complements even if they were no 
longer sensitive, but they were not really "serious contenders" with the one-cell-at-a- 
time strategy. It is also worth mentioning that we used a two-pass approach with an 
information cost function in the second pass (see section 3). Results may be better 
from an “information cost” perspective but worse from a "value cost" perspective. 

A problem in using the block approach with a three-dimensional problem was that 
we used “slices” for blocks - not getting the full picture. With three dimensions, 
complementary suppression cells inside a block were more likely to have arisen from 
the need to protect sensitive cells outside of it. Although the use of cubes correspond- 
ing to the 17 trade groups would have been more appropriate, these would have re- 
quired too many waivers. Not only would it take considerable effort to obtain those, 
but we would be more likely to encounter refusals to grant waivers - making the out- 
come of the exercise unpredictable. 



7 The Segment Cost Approach 

The segment cost approach was developed from the need to have a method that better 
combined a cell-level score and a block-level score, and that addressed the three di- 
mensional problem more satisfactorily. Its development was partly influenced by two 
other observations. One was the fact that an exceptionally large sensitive cell was 
“missed” in the mixed score approach. The other was the observation that, because of 
the sometimes “flat” hierarchical structure of the data (trade groups with few indus- 
tries, only three levels for type of revenue) some small sensitive cells only had huge 
cells available as potential complements. 

Segments are one-dimensional sets of cells adding up to a marginal total (e.g., the 
industries comprising a trade group, the ten provinces or the three revenue compo- 
nents) when the other dimensions’ values are fixed. In each segment the cell i with the 
largest sensitivity value, Sj, is a suitable complement for all other sensitive cells j in 
that segment (since T^ S S, S Sj, where T^ is the value of cell i). The cell i is also pro- 
tected, perhaps partially, by the other sensitive cells j so that the residual sensitivity R 
requiring protection is R^ = max{0 , S, - } (for j sensitive). 
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The "minimum cost" of protecting the cell i along that segment is defined to be the 
lesser of (a) the smallest non-sensitive cell k with value > R; (b) the smallest sum 
of non-sensitive cells smaller than R that is greater than or equal to R (i.e., min{XtT^I 
k non-sensitive, T^<R, > R}) up to a maximum number of cells that we set at 6, 

and (c) the marginal sub-total for that segment. The first two values may not exist, 
and the third may be a sensitive cell. To each interior cell i, then, is associated a cost 
equal to its value, Tj, plus the sum of the minimum cost along each of its axes. The 
expression "minimum cost" was put in quotes because it is minimum under certain 
conditions. For each segment, only one sensitive cell is attributed the corresponding 
minimum cost. The other cells along that segment are not. 

This new cost function is relatively straightforward to calculate, does not consider 
complementary suppression cells (and thus is not concerned with whether they origi- 
nated from the sensitive cell or not), can incorporate a number of dimensions, takes 
into account the size of the cell and additional information about surrounding cells. Its 
shortcomings are that it ignores the impact on suppressions outside the axes of the 
sensitive cell and that the “winner take all” method of allocating the minimum cost in 
a segment to its most sensitive cell does not consider what happens to the other sensi- 
tive cells after the most sensitive cell is published (waivers are obtained). Finally, 
although we have not covered the issue in this paper, it does not cover the case where 
enterprises are present in multiple cells. 

In spite of its shortcomings the method performed well for the WTS. As seen in the 
last two columns of table 2, the approach consistently gave better results in terms of 
value suppressed than the other two methods - sometimes much better. Because the 
approach was oriented towards minimizing the value suppressed it did not perform 
well when considering the number of suppressed cells. 



8 Conclusion 

The findings above represent work in progress. They have been tried for a limited set 
of problems and gave different results for differently structured populations. The first 
approach incorporated several factors into its mixed score and targeted blocks for 
treatment - which was useful when the number of blocks was large and the number of 
waivers required within them was small. This approach can help subject matter spe- 
cialists in deciding where to focus their attention. One problem was in its application 
to three-dimensional problems, although even then it may give good results with sur- 
veys having a large number of "cubes" small enough to be used as blocks. In empiri- 
cal investigations with the Wholesale Trade Survey it was shown that it was not al- 
ways beneficial to exhaust all the waivers in a block before going to the next. This is 
probably the aspect of this approach that requires the most attention. 

The segment cost approach seemed to represent a good compromise between the 
need to consider a cell on its own and in relation to others. It was aimed at reducing 
the value of suppressed cells, which it did rather well, but as a result it did not per- 
form as well in reducing the number of suppressions. One particular type of problem 
it targeted (small cells with large complements) may suggest alternative approaches to 
waivers such as rounding [4] or controlled tabular adjustment [1]. This method could 
probably benefit from weights that gave different importance to the costs along the 
different dimensions in the segments (e.g., industry, province, revenue type). Subject 
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matter guidance would be needed. It should also try to reduce better the number of 
cells suppressed. 

One aspect that was not addressed directly in this study was the use of marginal 
cells for waivers. Marginal cells are sensitive because of enterprises that are also re- 
sponsible for the sensitivity of internal cells, so the issue was partly addressed. How- 
ever, the "score" assigned to a sensitive cell should also be distributed to its largest 
enterprises so that enterprise level scores are possible. In surveys where enterprises 
can contribute to many cells, this may point, for example, to enterprises responsible 
for a number of sensitive cells. 
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Abstract. Statistical agencies typically serve a diverse group of end 
users with varying information needs. Accommodating the conflicting 
needs for information in combination with stringent rules for statistical 
disclosure limitation (SDL) of statistical information creates a special 
challenge. We provide a generic table server design for SDL of tabular 
data to meet this challenge. Our table server design works equally well 
with counts data and magnitude data, and is compatible with commonly 
used cell perturbation methods and cell suppression methods used for the 
statistical disclosure control of sensitive tabular data. We demonstrate 
the scope and the effectiveness of our table server design on counts and 
magnitude data by using a simplified controlled tabular adjustment pro- 
cedure proposed by Dandekar (2003). In addition to ad hoc queries, the 
information compiled using our table server design could be used to cap- 
ture multi-way interactions of counts data and magnitude data either in 
a static environment or in dynamic mode. 



1 Introduction 

Procedures to protect sensitive cells in tabular data have evolved over the last few 
decades. In recent years there has been an increased realization that “one table 
at a time” tabular data protection procedures, though easy to implement, could 
be error prone and pose serious disclosure risks. As a result, system designers 
are increasingly tasked to develop a unified strategy to simultaneously protect 
multiple different tables resulting from the same survey data collection form. 
The complexity of the data collection form and resulting tabular data structures 
in related publications varies from survey to survey. This makes it difficult to 
develop a unified design strategy that will accommodate multiple variations in 
table structures as well as inter-table relations. 

In the first part of the paper, we propose a unified method to simultaneously 
capture and process multiple different tables containing complex hierarchies and 
linked structures resulting from the same data collection efforts. As an integral 
part of our proposed method, we introduce the concept of a “guidance matrix” 
to capture complex interaction effects present within and across table structures. 
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We first explain the role of the guidance matrix concept in capturing complexi- 
ties of seven different test cases of magnitude data. These test cases are currently 
available in the public domain to all researchers of tabular data protection from 
the website: http://webpages.ull.es/users/casc/ficheros.csp.htm. We also demon- 
strate the effectiveness of the guidance matrix concept in capturing multi-way 
interaction terms by illustrating its use on real life eight-dimensional counts data. 

In the second part of the paper, we demonstrate the unique characteristics 
associated with simplified Controlled Tabular Adjustment (CTA) procedure by 
Dandekar (2003). This technique when applied to any generic N dimensional 
counts data structure in its entirety, offers complete protection from statistical 
disclosure. We utilize the guidance matrix concept, developed in the first part 
of the paper, to demonstrate the statistical validity of the CTA-adjusted counts 
table structure by performing cross section by cross section analysis of the SDL 
protected eight-dimensional counts data. We conclude by providing basic details 
on how to design a generic web-based table server to function in a dynamic 
mode by using the simplified CTA procedure. Our proposed web-based table 
server design, after some fine tuning, is capable of serving diverse user groups 
with conflicting information needs. 



2 Concept of Guidance Matrix 

A typical survey data collection form consists of multiple fields containing vari- 
ous categorical variables and related quantitative information. Categorical vari- 
ables are mainly used to define table structures for the publication of cumula- 
tive statistics related to the quantitative information. Often, various intervals 
of quantitative information fields are used to create additional categories in the 
table structure. 

For our discussion, we assume that the N dimensional space defined by the 
N different categorical variables has already been identified a priori by the sta- 
tistical office. In our proposed implementation method, we first construct an N 
columns by M rows boolean guidance matrix as follows: 

— J= 1 to M rows of the guidance matrix define M different tables to be created 
by the publication system. 

— Within the Jth row of the guidance matrix, the column position assumes the 
value of zero if the related categorical variable details are not included in 
the Jth table. The column value of one indicates that the related categorical 
variable details are available in the Jth table. 

We use the contents of the guidance matrix as a tool to connect, on a con- 
sistent basis, all different cross sections the statistical agency intends to display 
to the general public either through a conventional table publication system or 
through web access. The contents of the guidance matrix serve as a filter to 
extract only the relevant aggregates for publication purposes from a generic N 
dimensional table structure. Tables generated by using the guidance matrix con- 
cept could be protected by either cell perturbation or cell suppression methods. 
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3 Guidance Matrices for Seven Test Cases 

Currently, seven complex test cases of magnitude data are available to all SDL 
researchers of tabular data. Three of the seven test cases (HIER13, HIER16, and 
BTS4) consist of single tables with complex hierarchical structures. For these 
three test cases, the guidance matrix consists of only a single row. Due to the 
fact that all the column details are included in the table structure, all column 
values are assigned the value of one in the guidance matrix. The remaining 
four test cases each consist of multiple different multi-dimensional tables with 
complex linked structures. By following the guideline above, the guidance matrix 
for these test cases, namely NINENEW, NINE5D, TW05IN6 and NINE12, can 
be displayed by using a consistent procedure. The guidance matrices for the 
seven test cases are in Table 1. 



Table 1. Guidance Matrices for Seven Test Cases 
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Two of the three 9-D test cases, namely NINENEW and NINE12, are created 
from the same micro data file. In these two test cases, only a limited number 
of distinct 4-D linked tables are captured in 9-D variable space. In a real data 
situation, the statistical office would have decided on the substructure, including 
a dimensionality, for each table by taking into account user needs for published 
information. 

In an extreme situation, the statistical office could have decided on capturing 
all the possible 4-D linked tables resulting from the 9 variables and making them 
available to the general public after statistical disclosure control of the tabular 
data. Under such a scenario, the guidance matrix will consist of 126 rows derived 
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by using a combination of nine items, taken 4 items at-a-time combinatorial 
logic. Similarly, in the event that there is a need to publish all possible 5-D or 
6-D tables resulting from the nine variables for public use, the guidance matrix 
consisting respectively of 126 and 84 rows could be constructed to satisfy user 
needs. 

4 Complex Interactions in Hierarchical Structures 

The guidance matrix concept offers considerable flexibility in capturing multi- 
way complex interactions in table publication systems consisting of multiple 
hierarchical structures. This is easy to demonstrate by using a hypothetical ex- 
ample. We assume that the statistical agency collects manufacturing industry 
data at the state level and three different Standard Industry Classification (SIC) 
levels, namely 2 digits, 3 digits and 4 digits SIC codes. We further assume that 
the individual states are further grouped into various regions. Complex require- 
ments, such as the statistical agency deciding to publish (a) 2 digits SIC level 
data at the state level (b) 2 digits and 3 digits SIC level data at the regional 
level and (c) 2 digits, 3 digits, and 4 digits SIC level data at the national level, 
are easy to implement by using the guidance matrix concept. In the following 
guidance matrix implementation we assume that the first five columns of the 
guidance matrix represent state code, region code, 2 digits SIC code, 3 digits 
SIC code and 4 digits SIC code respectively. The rest of the column positions 
in the guidance matrix are assumed to relate to other categorical variables. To 
capture the complex data publication requirement above, the guidance matrix 
is implemented as shown in Table 2: 

Table 2. Complex Hierarchies Using Guidance Matrix 



10100 2 digits SIC level state data 

OHIO 2 and 3 digits SIC level regional data 

00111 2, 3, and 4 digits SIC level national data 



5 Eight Dimensional Counts Data Using CPS File 
from UC Irvine 

Multi-way interaction terms are of importance in an analysis of counts and mag- 
nitude data. In this section, we demonstrate the versatility of the use of a guid- 
ance matrix to capture multi-way interactions on eight dimensional counts data. 
The file contains information on 48,842 individuals. The file was obtained from 
Dr. Alan Karr of the National Institute of Statistical Sciences (NISS) in August 
2001 Dr. Karr believes this to be real data created by using a public use file 
with the original source of Current Population Survey (CPS) file from UC Irvine. 



^ In the SDL literature, the 8-D Counts data has been used by NISS to justify partial 
release of a select few summary tables as a statistical disclosure control strategy. 
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As stated by Dr. Karr, there were originally more variables and more categories 
for some of the variables and that NISS staff did some collapsing of the variables. 
After collapsing, the test data has the characteristics shown in Table 3. 



Table 3. Characteristics of Eight Dimensional Counts Data 



Variable 
Age 

Employer Type 
Education Completed 
Marital Status 
Race 
Sex 

Hours Worked 
Annual Salary 2 



Category Label 

<25,25-55, >55 

Govt, Priv, SE, Other 

<HS, HS, Bach, Bach'", Coll" 

Married, Unmarried 

White, Non White 

Male, Female 

<40, 40, >40 

<$50K, >=$50K 



# of Categories 

3 

4 

5 
2 
2 
2 
3 



There are multiple ways this data could be processed for statistical disclo- 
sure control and released for general public use. As an option, based on the 
data users’ needs the statistical office could decide a priori on the total num- 
ber of cross sections and dimensionalities of each cross section for public release 
by using the concept of a guidance matrix; and thereafter simultaneously ap- 
ply an appropriate SDL method to those cross sections prior to the release of 
the information for general public use. Both cell suppression methods and cell 
perturbation methods could be used for statistical disclosure control of related 
tables. 

Alternatively, the statistical office could decide on public release of table 
structures by using all the possible combinations of eight variables. Table 4 
shows eight such options available to the statistical agency. In the first option, 
the statistical office could decide to release one cross section containing the entire 
8-D structure. Such a structure will contain 33,860 non-zero cells, of which 3,874 
cells could be considered as sensitive cells because those cells contain 2 or fewer 
respondents. To protect such a table from statistical disclosure by using linear 
programming based techniques, a total of 113,538 equality constraints will need 
to be evaluated for a feasible solution. Similarly, the statistical agency could 
decide on publishing eight 7-D structures or twenty-eight 6-D structures and so 
on. The total number of cross sections for release for each option is determined 
by using combinatorial logic. 

From Table 4, it is obvious that in the option to release all possible 3-D 
sections, there are no sensitive cells involved. As a result, the statistical office 
could safely release those 56 cross sections to the general public without any 
concern for statistical disclosure. The last two options in the table, namely 28 
2-D sections and eight 1-D sections, are subsets of fifty-six 3-D sections. 

It is possible that some of the 70 4-D sections might not contain any sensi- 
tive cells. The 4-D sections without any sensitive cells could also be released to 
the general public without any concern for statistical disclosure. The drawback 
of such a data release strategy is that only a small fraction of the total data 
collection effort is published. In addition, cross sections released using such a 
strategy might not be what the majority of data users need. Considering the 
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Table 4. Characteristics of Multidimensional Cross Sections of 8-D Counts Data 



Dimension 


Total 


Total 


Total 


Total 


of the 


Number of 


Number of 


Number of 


Number of 


Cross 


Cross 


Non -Zero 


Sensitive 


Equality 


Section 


Sections 


Cells 


Cells 


Constraints 


8 


1 


33,860 


3,874 


113,538 


7 


8 


32,165 


3,327 


105,045 


6 


28 


25,367 


1,832 


76,731 


5 


56 


14,609 


498 


39,411 


4 


70 


5,755 


56 


13,561 


3 


56 


1,506 


0 


n/a 


2 


28 


251 


0 


n/a 


1 


8 


24 


0 


N/A 



Table 5. Characteristics of Eight 7-D Cross Sections from 8-D Counts Data 



Section # 


# of Non- Zero 
Cells 


# of 

Sensitive 

Cells 


# of Equality 
Constraints 


1 


12,200 


330 


35,777 


2 


8,961 


685 


27,229 


3 


11,748 


1,063 


34,911 


4 


11,921 


915 


35,232 


5 


11,829 


970 


35,076 


6 


6,179 


305 


19,119 


7 


7,446 


437 


22,755 


8 


9,394 


508 


28,139 



cost involved in conducting surveys, the statistical office needs to concentrate on 
developing a strategy that will publish as much of the information as possible 
after adequately protecting data from statistical disclosure. We tackle that issue 
in later sections. 

The classical literature related to SDL of tabular data often relies on a se- 
quential processing of related cross sections followed by a backtracking procedure 
to ensure consistent protection across all cross sections. Apart from the ease of 
software implementation, such a strategy reduces the technical problem to a 
manageable size. Such a practice was desired and was considered acceptable in 
the past due to the practical resource constraints dictated by limited compu- 
tational power. We believe that such an approach, in some instances, could be 
error prone and is not justified in an era where computational resources are of 
minor concern. However, as a matter of general research interest, in Table 5 we 
summarize the characteristics of one such option namely, generating and pro- 
cessing eight 7-D structures sequentially - option 2 from Table 4. The overall 
summary statistics of these eight sections are shown using a format similar to 
the one used in Table 4. 

The contents of Table 5 are based on the eight 7-D cross sections derived 
by excluding one variable at a time in 7-D cross sections, starting with section 
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number 1 which excludes the 8th variable, followed by section number 2 which 
excludes the 7th variable and so on. The guidance matrix associated with these 
eight sections is in Table 6. 



Table 6. Guidance Matrix for Eight 7-D Cross Sections 

11111110 

11111101 

11111011 

11110111 

11101111 

11011111 

10111111 

01111111 



As can be seen collectively from Tables 4, 5, and 6, when processed as a 
single table consisting of eight 7-D sections, there are 32,165 non-zero cells in 
a table and over 105K equality constraints to work with. However, when these 
eight sections are identified and processed separately, the non-zero cell count 
and the sensitive cell count varies widely from section to section. The total 
number of equality constraints in each section reduces significantly, reducing 
the computational complexity and computation time by an order of magnitude 
for LP-based data protection methods. The sum of the cell count over all the 
sections does not add up to 32,165 when each section is processed separately. 
This is due to complex 7-way interaction effects captured through the guidance 
matrix. 



6 Statistical Disclosure Control 

by Controlled Tabular Adjustments 

To minimize the information loss, as an option, the statistical office could use 
controlled tabular adjustments (CTA) on counts data and magnitude data^. 
Dandekar/Cox (2002) have proposed using a LP- based CTA implementation 
procedure. Dandekar (2003) has proposed using a simplified CTA procedure, 
which attempts to minimize percentage deviation from the true cell value for 
non-sensitive cells without using any special purpose software. In Table 7, we 
summarize the outcome from both the CTA procedures when used on NINE 12 
magnitude test data. The LP-based procedure uses the reciprocal of cell values 
as a cost function to minimize the overall deviation of non-sensitive cells from 
the true cell value. The reciprocal of the cell value-based cost function closely 
captures the minimum percent deviation criteria used in the simplified CTA 
procedure. 

^ Some iterative refinement of the CTA protected counts data might be reqnired to 
eliminate fractional adjustments. 
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Table 7. Comparison of LP-Based CTA and Simplified CTA Method 



LP-Based CTA 12 x 4-D Cross sections 



< > < Absolute Change in Cell Value > 
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Simplified CTA 12 x 4-D Cross sections 
< > < Absolute Change in Cell Value > 
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In Table 7, we use a two-dimensional histogram of a non-zero cell count to 
show how the cell value changes are distributed by the cell size. The rows in 
the table are classified into various cell size categories. The table columns are 
used to classify the changes in the cell value in various categories after the CTA 
procedure is used on the NINE 12 test case. The comparative analysis of the 
outcome from the two CTA methods reveals that the simplified CTA procedure 
makes overall small changes to the magnitude data relative to the LP-based 
procedure proposed by Dandekar/Cox (2002). Approximately 62% of cells (6,275 
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out of 10,399 cells) in the simplified CTA method are unchanged, but only 20% 
of cells (2,051 out of 10,399 cells) in the LP-based CTA procedure are unchanged. 
Based on the percentage change in cell value criteria, 73% of cells in the LP-based 
CTA procedure and 79% of cells in the simplified CTA procedure are within 1% 
of the true cell value. Based on these statistics, it is clear that the simplified 
CTA procedure offers a low cost, computationally efficient, practical alternative 
to the LP-based CTA procedure. 

As mentioned earlier, in the NINE12 test case only 12 out of the possible 
126 4-D cross sections were included to generate 4-D linked tables. In Appendix 
A, we provide similar two-dimensional histograms based summary statistics for 
all possible 4-D, 5-D and 6-D sections protected by CTA procedure. Databases 
created using these three test cases result in (a) 126 4-D sections containing 
60,588 non-zero cells (b) 126 5-D sections containing 120,161 non-zero cells and 
(c) 84 6-D sections containing 161,783 non-zero cells. In all three test runs, 68% 
of total cell values are consistently within 1% of the true cell value. 

7 Simplified CTA Procedure for Counts Data 

We have used the simplified CTA procedure on the entire 8-D cross section of 
the counts data^. In our CTA implementation, the cell values of less than three 
are selected for an adjustment at random. The values for these cells are changed 
either to zero or to three in such a way that the net adjustment to the table total 
always stays close to zero after each adjustment. The CTA procedure changes 
the table total from 48,842 to 48,844. Table 8 summarizes the outcome from the 
simplified CTA procedure by using the two-dimensional histogram constructed 
in a format similar to Table 7. Approximately 10% of 8-D cross section data 
(3,874 out of 33,860) contain sensitive cells. Yet, approximately one third of 
the total cells (10,986 out of 33,860) are unchanged. Another third of the cells 
(10,266 out of 33,860) are changed by only a single unit. The remaining one 
third of the cells undergo varying amounts of change^. Overall, only 18 percent 
of the cells are altered by more than 2 count. 

By using the strategy proposed by Dandekar/Cox (2002) to convey the over- 
all cell value accuracy to the general public, the CTA protected 8-D section of 
the counts data could be released for public use in its entirety. As a part of 
the recommended strategy, the statistical agency could make the Table 8 format 
summary statistics publicly available to all data users. In Table 8, some collaps- 
ing of small cell categories might be required to avoid an indirect disclosure of 
the data protection strategy. 

® The CTA protected data and related log-linear model analysis is available by con- 
tacting the author. We look forward to comparative evaluation of CTA protected 
data with other sensitive tabular data protection methods. 

The preliminary analysis reveals that for a large population, relatively small changes 
in the actual cell count does not hinder the statistical inference based on log-linear 
models. Uncertainty bounds of log-linear model based statistical inferences resulting 
from the small changes in actual cell counts are easy to establish by using Latin 
Hypercube Sampling technique. 
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Table 8. Histogram By Cell Size and Change in Cell Value 

B-D CROSS SECTIONS - CELL COUNT: 33860 
< > < Absolute Change in Cell Value > 
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To convey cell level data quality, the statistical agency could use the quality 
indicator field in connection with cell values to reflect “poor” or “good” data 
quality. The other option could be to simply not publish the cell values which are 
beyond previously accepted threshold accuracy criteria. In the later option, if a 
user decides to estimate the missing cell value by making use of related equality 
constraints, that will be the choice for those who need the “ball park” estimate. 

Statistical agencies typically serve a diverse group of end users with varying 
information needs. Once CTA-adjusted data is in the public domain, different 
users are interested in different cross sections of these data. For example, for 
individuals interested in developing log-linear models on these data, the first few 
orders of interaction terms captured by all combinations of 3-D cross sections is 
all that might be required. As mentioned earlier, there are no sensitive cells in 
any of the 3-D sections of the database. However, the challenge for the statistical 
agency is to balance the need for the data for log-linear model developers with 
potential other users with different information needs. 

In Table 9 we explain how CTA attempts to balance data utility and informa- 
tion loss. Table 9 is in 3 parts. Part A of the table shows cell size distribution for 
different cross section dimensions®. As illustrated in the table, the lower dimen- 
sional cross sections are typically larger in cell size. In our example, the upper 
limit for the cell size is 48,842, which is the total number of individuals in the 
database. At the bottom of the first part of the table, we provide a cumulative 
cell count for various cross sections. If log-linear modelers were the only users 

® In multi-variate data analysis, the cell values from a 1-D cross section are used to 
develop a main effects model. The cell values from a 2-D cross section are used 
to capture two-way interaction terms, and so on for the higher dimensional cross 
sections. 
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Table 9. Cross Sectional Details of Simplified CTA Protected Counts Data 

Part<A) CELL COUNT BY CELL SIZE 

CELL SIZE < CROSS SECTION DIMENSION > 



From 


To 


1-D 


2-D 


3-D 


4-D 


5-D 


6-D 


7-D 


8-D 


1. 


- 


1. 


0. 


0. 


0. 


27. 


282. 


1073. 


2026. 


2387. 


2. 


- 


2. 


0. 


0. 


0. 


29. 


216. 


759. 


1301. 


1487. 


3. 


- 


5. 


0. 


0. 


3. 


60. 


542. 


1646. 


2647. 


2955. 


6. 


- 


10. 


0. 


0. 


2. 


113. 


673. 


1793. 


2702. 


2947. 


11. 


- 


25. 


0. 


0. 


18. 


261. 


1353. 


3189. 


4400. 


4655. 


26. 


- 


bO. 


0. 


1. 


25. 


319. 


1409. 


2872. 


3632. 


3766. 


51. 


- 


100. 


0. 


1. 


50. 


523. 


1792. 


3124. 


3721. 


3825. 


101. 


- 


500. 


0. 


10. 


319. 


1826. 


4413. 


6337. 


€994. 


7082. 


501. 


- 


1000. 


0. 


16. 


254. 


860. 


1485. 


1852. 


1973. 


1986. 


1001. 


- 


5000. 


2. 


113. 


574. 


1356. 


2027. 


2303. 


2350. 


2351. 


5001. 


- 


10000. 


6. 


50. 


157. 


260. 


296. 


296. 


298. 


298. 


10001. 


- 


25000. 


9. 


48. 


92. 


109. 


109. 


109. 


109. 


109. 


25001. 


- 


50000. 


7. 


12. 


12. 


12. 


12. 


12. 


12. 


12. 


Total 


# of Cell 


24. 


251. 


1506. 


5755. 


14609. 


25367. 


32165. 


33860. 


Part<B) 




AVERAGE CELL VALUE CHANGE BT CELL SIZE 






From 


To 


1-D 


2-D 


3-D 4-D 


5-D 


6-D 


7-D 


8-D 


1. 


_ 


1. 


.00 


.00 


.00 


1.30 


1.30 


1.32 


1.33 


1.33 


2. 


- 


2. 


.00 


.00 


.00 


1.48 


1.65 


1.64 


1.59 


1.56 


3. 


- 


5. 


.00 


.00 


.67 


1.68 


1.64 


1.43 


1.18 


1.06 


6. 


- 


10. 


.00 


.00 


.50 


2,19 


1.85 


1.50 


1.25 


1.15 


11. 


- 


25. 


.00 


.00 


2.28 


2.20 


1 . 93 


1.57 


1.31 


1.24 


26. 


- 


50. 


.00 


.00 


2.68 


2.57 


2.07 


1.63 


1.36 


1.31 


51. 


- 


100. 


.00 


4.00 


3.68 


2.90 


2.26 


1.76 


1.51 


1.47 


101- 


- 


500. 


.00 


4.80 


3.62 


2.99 


2.30 


1.84 


1 . 68 


1.66 


501 . 


- 


1000. 


.00 


2.56 


3.91 


3,17 


2.49 


2.09 


1 .97 


1.95 


1001. 


- 


5000. 


2.50 


5.19 


4.41 


3.42 


2.71 


2.42 


2.37 


2.37 


5001. 


- 


10000. 


1.33 


4.42 


4.03 


3.52 


3.19 


3.17 


3.17 


3.17 


10001. 


- 


25000. 


5.44 


4.75 


4.78 


4.26 


4.26 


4.26 


4.26 


4.26 


25001. 


- 


50000. 


3.14 


5.06 


5.08 


5.08 


5.08 


5.00 


5.08 


5.08 


Part<C> 






AVERAGE 


PERCENT CELL 


VALUE 


CHANGE 


BY CELL SIZE 








• 
















. 


















From 


To 


1-D 


2-D 


3-D 


4-D 


5-D 


6-D 


7-D 


8-D 


1. 


_ 


1. 


.00 


.00 


.00 


129.63 


130.50 


132.06 


133.27 


133.43 


2. 


- 


2. 


.00 


.00 


.00 


74.14 


82.41 


82.21 


79.40 


77.91 


3. 


- 


b. 


.00 


.00 


16.67 


42.08 


41.10 


35.66 


29.46 


26.39 


6. 


- 


10. 


.00 


.00 


6.25 


27,32 


23.07 


18.70 


15.62 


14.32 


11. 


- 


25. 


.00 


.00 


12.65 


12.24 


10.73 


8.71 


7.28 


6.88 


26. 


- 


50. 


.00 


.00 


7.05 


6.76 


5.46 


4.29 


3.59 


3.46 


51. 


- 


100. 


.00 


5.30 


4.87 


3.85 


3.00 


2.33 


2.01 


1.95 


101. 


- 


500. 


.00 


1.60 


1.20 


1.00 


.76 


.61 


.56 


.55 


501. 


- 


1000. 


.00 


.34 


.52 


.42 


.33 


.28 


.26 


.26 


1001. 


- 


5000. 


.08 


.17 


.15 


.11 


.09 


.08 


.0$ 


.08 


5001. 


- 


10000. 


.02 


.06 


.05 


. 05 


.04 


.04 


.04 


.04 


10001. 


- 


25000. 


.03 


.03 


.03 


.02 


.02 


.02 


.02 


.02 


25001. 


- 


50000. 


.01 


.01 


.01 


.01 


.01 


.01 


.01 


.01 



of this database, the statistical agency could release the information for a total 
of 1,506 cells associated with all the 3-D cross sections in the database (without 
confidentiality concern) and choose not to publish the rest of the cells in the 
database. By adopting such a policy, however, the statistical agency would have 
published only 4% (1,506 out of 33,860 cells) of data collected by the survey 
operation. The challenge facing the statistical agency is how to balance the data 
needs for diverse user groups. 
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Part B of Table 9 shows the average change in the cell value within various 
cross sections of CTA adjusted 8-D data by using the same size categories as 
in the first part of the table. Part C of the Table 9 shows the corresponding 
percentage changes in cell values in different cross sections by related size cate- 
gories. Due to the fact that the majority of the lower dimensional cross sections 
are relatively large in cell size, the resulting percentage adjustments from the 
CTA procedure are generally smaller than one percentage point. Such a small 
percentage change in cell values is therefore likely to be statistically insignifi- 
cant for hypothesis testing using log-linear model analysis. At the same time, a 
diverse group of other data users have access to all other relevant information 
from various cells of different dimensionalities resulting from the 8-D database. 
To demonstrate the suitability of the CTA protected data for log-linear analysis, 
in Appendix B, we provide a comparative evaluation of the actual and CTA 
protected data on the 3-D section of age, education and annual salary from the 
8-D data. 

8 Static vs. Dynamic Table Server Design 

In recent years, there has been some discussion on operating SDL protected 
table servers in dynamic mode through interactive web access. The guidance 
matrix concept, in combination with commonly used cell suppression and/or cell 
perturbation methods, offers wide flexibility to statistical agencies in generating 
a wide variety of table structures for public use. Except for one, all tabular data 
protection methods require (a) a number of categorical variables for publication, 
(b) a number of table structures, and (c) dimensionality of the table structure 
for public release to be defined a priori to ensure adequate statistical disclosure 
control. 

For the simplified CTA procedure, however, there is no need to know the 
total number of table structures and table dimensionalities for public release a 
priori in the system design. This allows the table server to function more or less 
in a dynamic mode without concern for statistical disclosure. Whenever dynamic 
table server capability is required through interactive web access, the statistical 
agency has two separate options. In the first option, the statistical agency could 
store in a database only the internal tabular cell values after adjusting internal 
sensitive table cells by plus or minus adjustment factors. The statistical agency 
could then provide user specific table generation capability in real time. In the 
second option, the statistical agency could store in the database management 
system all SDL protected internal cell values along with all possible marginal 
cell values. Users will be allowed to retrieve necessary information by accessing 
the database through a query system. 

9 Conclusion 

The concept of a guidance matrix allows statistical agencies a flexible option 
to simultaneously capture and process multiple multi-dimensional cross sections 
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containing complex hierarchies in tabular data release. The guidance matrix 
tool is equally effective on counts data and on magnitude data and is compatible 
with all cell suppression and cell perturbation methods used to protect sensi- 
tive tabular data. Table servers, based on the CTA procedure for SDL control, 
in combination with flexibility of data release offered by the guidance matrix 
concept, further enhances table server capabilities. 

Log-linear models are association models. Inferences based on any model are 
subject to some uncertainty. The uncertainty bounds for model-based inferences 
are easy to estimate and quantify by performing sensitivity/uncertainty analysis. 
Latin hypercube sampling based experimental design studies are routinely con- 
ducted to simultaneously capture the effects of uncertainties in multiple model 
input parameters on the model inference(s). We look forward to such evaluations 
of log-linear and other relevant models by using CTA protected tabular data. 



Appendix A 

simplified CTA on 9-D Magnitude Data 

126 X 4-D Cross sections - CELIi COUNT: 60,588 

< > < Absolute Change in Cell Value > 
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Size 
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To 


To 
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To 
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10 
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50 
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0 


0 


0 


0 


0 


0 


0 


33 


- 


64 


606 


97 


0 


0 


0 


0 


0 


0 


0 


65 


- 
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86 


0 


0 


0 


0 


0 


0 


129 


- 


250 


941 


331 


575 


0 


0 


0 


0 


0 


0 


251 


- 


500 


1074 


49 


809 


906 


0 


0 


0 


0 


0 


501 


- 


1000 


3689 


79 


239 


2267 


2242 


0 


0 


0 


0 


1001 


- 


2000 


7045 


239 


326 


629 


4143 


110 


11 


0 


0 


2001 


- 


4000 


12819 


405 


765 


1096 


1849 


278 


108 


6 


0 


4001 


- 


8000 


5962 


358 


634 


1155 


2302 


345 


99 


23 


1 


8001 


- 


16000 


959 


202 


307 


488 


905 


241 


79 


32 


2 


16001 


- 


32000 


34 


41 


73 


95 
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66 


54 
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84 X 6-D Cross sections - CELL COUNT: 161, 783 



< > < Absolute Change in Cell Value > 
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ACTUAL DATA - LOG LINEAR ANALYSIS 



PARTIAL ASSOCIATION STATISTICS USING IMSL ROUTINE "CTASC 
3-D SECTION X ( A=AGE, B=EDUCATION, C=ANNUAL SALARY ) 



Table 1: C - L 







A (row) 


by B (column) 










1 2 


3 


i 


5 






1 


1738.0 2444.0 


1088.0 


33.0 


3036.0 






2 


3051.0 8930.0 


5676.0 


1245 . 0 


5010.0 






3 


1253.0 1907.0 


675.0 


300.0 


769.0 








Table 2: C - 


2 










A (row) 


by B (column) 










1 2 


3 


4 


5 






1 


11.0 20.0 


28.0 


4.0 


30.0 






2 


221.0 2020.0 


3657.0 


2037.0 


1694.0 






3 


134.0 463.0 


563.0 


466.0 


339.0 








Partial Association Statistics 






Omitted 




Degrees of 




Marginal 


Effect 




Chi-Square 


Freedom 


P-value 


Zeros 


A 




25535.30 


2.0 




.0000 


.0 


D 




9153.81 


4.0 




.0000 


.0 


C 




13958.71 


1.0 




.0000 


.0 


A*B 




2584.77 


8.0 




.0000 


.0 


A*C 




3089.65 


2.0 




.0000 


.0 


B*C 




4606.03 


4.0 




.0000 


.0 


A*B*C 




26.51 


CO 

o 




.0009 


.0 



Chi-square statistics 
Likelihood 
k Ratio 

1 61349.58 

2 12701.76 

3 26.51 



for testing that all k and higher interactions are zero. 
Degrees of 

P-Value Freedom Pearson P-Value 



.0000 25.0 73429.15 
.0000 22.0 12810.05 
.0009 8.0 28.43 



.0000 

.0000 

.0004 



Chi-square statistics 

Likelihood 
k Ratio 

1 48647.82 

2 12675.24 

3 26.51 



for testing that all k-factor interactions are 
simultaneously zero. 

Degrees of 

P-Value Freedom Pearson P-Value 

.0000 7.0 60619.14 .0000 

.0000 14.0 12781.62 .0000 

.0009 8.0 28.43 .0004 
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Appendix B2 







CTA PROTECTED DATA - 


LOG 


LINEAR 


ANALYSIS 


PARTIAL ASSOCIATION 


STATISTICS USING IMSL 


ROUTINE "CTASC 




3- 


D SECTION X ( A 


=AGE, B=EDUCATION, C=ANNUAL SALARY ) 






Table 1: C • 1 












A {row) 


by B (colium) 










1 2 


3 


4 


5 






1 


1745.0 2453.0 


1082.0 


31.0 


3034.0 






2 


3048.0 8933.0 


5681.0 


1240.0 


5012.0 






3 


1255.0 1901.0 


680.0 


298.0 


763.0 








Table 2: C - 2 












A {row) 


by B (coliunn) 










1 2 


3 


4 


5 






1 


6.0 17.0 


31.0 


6.0 


29.0 






2 


226.0 2016.0 


3653.0 


2043 . 0 


1691.0 






3 


127.0 467.0 


562.0 


471.0 


343.0 








Partial Aeeociation Statiatica 






Ooiitt«d 






Oegreea of 




Marginal 


Effect 




Cbl- Square 


Freedom 


F 


-value 


Zoroo 


A 




25539.01 


2.0 




.0000 


.0 


B 




9150.00 


4.0 




.0000 


.0 


C 




13958.07 


1.0 




.0000 


.0 


A*B 




2572.01 


8.0 




.0000 


.0 


A*C 




3110,17 


2.0 




.0000 


.0 


B*C 




4655.57 


4.0 




.0000 


.0 


A*B*C 




24.14 


8.0 




.0022 


.0 



Chi-square atatlatics for testing that all Ic and higher interactions are zero. 



Likelihood 
k Ratio 

1 61423.50 

2 12776.41 

3 24.14 



Degrees of 
P -Value Freedom 



Pearson 



.0000 29.0 73484.04 
.0000 22.0 12879.93 
.0022 8.0 27.01 



P -Value 
.0000 
.0000 
.0007 



Chi-square statistics for testing that all k-£actor interactions are 



simultaneously zero. 
Likelihood Degrees of 



k 


Ratio 


P-Value 


Freedom 


Pearson 


P-Value 


1 


48647.08 


.0000 


7.0 


60604.11 


.0000 


2 


12752.27 


.0000 


14.0 


12852.92 


.0000 


3 


24.14 


.0022 


8.0 


27.01 


.0007 
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Abstract. Current network flows heuristics for cell suppression in pos- 
itive tables (i.e, cell values are greater or equal than zero) rely on the 
solution of several minimum-cost subproblems. A very efficient heuristic 
based on shortest-paths was also developed in the past, but it was only 
appropriate for general tables (i.e., cell values can be either positive or 
negative), whereas in practice most real tables are positive. The method 
presented in this work overcomes the lacks of previous approaches: it is 
designed for positive tables and only requires the solution of shortest- 
paths subproblems. It follows that it is orders of magnitude faster than 
previous network flows heuristics. We report an extensive computational 
experience in the solution of two-dimensional tables, either with or with- 
out hierarchical rows, of up to 250000 and 500000 cells, respectively. 

Keywords: statistical disclosure control, cell suppression problem, lin- 
ear programming, network optimization, shortest-paths. 



1 Introduction 

Cell suppression is a widely used technique by national statistical institutes for 
the protection of confidential tabular data. Given a list of cells to be protected, 
the purpose of the cell suppression problem (CSP) is to find the pattern of ad- 
ditional (a.k.a. complementary or secondary) cells to be suppressed to avoid the 
disclosure of the sensitive ones. This pattern of suppressions is found under some 
criteria as, e.g., minimum number of suppressions, or minimum value suppressed. 

Because of the NP-hardness of CSP [11], most of the former approaches 
focused on heuristics for approximate solutions. This work deals with them, 
presenting a new one for positive tables (i.e., cell values are greater or equal 
than zero) based on shortest-paths, which turned out to be orders of magnitude 
faster than alternative methods. A recent mixed integer linear programming 
(MILP) method [5] was able to solve until optimality nontrivial CSP instances. 

* Supported by the EU IST-2000-25069 CASC project and the Spanish MCyT Project 
TIC2003-00997. The author thanks Narcfs Nabona for providing him with the im- 
plementation of the Dijkstra’s shortest-paths algorithm. 
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The main inconvenience of this approach, from the practitioner point of view, 
is that the solution of very large instances (with possibly millions of cells) can 
result in impractical execution times [6]. It is noteworthy that improvements in 
new heuristics also benefit the above exact procedure, since they provide it with 
a fast, hopefully good, feasible starting point. 

Most current heuristic methods for CSP are based on the solution of network 
flows subproblems [1]. Other approaches, as the hypercube method [7], focus on 
geometric considerations of the problem. Although very efficient, the hypercube 
method provides patterns with a large number of secondary cells or value sup- 
pressed (i.e., it suffers from over-suppression). Network flows heuristics for CSP 
usually exploit more efficiently the table information and provide better results. 

There is a fairly extensive literature on network flows methods for CSP. For 
positive tables, they rely on the formulation of minimum-cost network flows sub- 
problems [3,4,11]. Such approaches have been successfully applied in practice 
[10, 13]. These heuristics require the table structure to be modeled as a network, 
which can only be accomplished for two-dimensional tables with, at most, one hi- 
erarchical dimension (either rows or columns) . Although minimum-cost network 
flows algorithms are fast compared to the equivalent linear programming for- 
mulations [1, Ch. 9-11], for large tables they still require large execution times. 
Instead, the approach suggested in [2] formulated shortest-paths subproblems, 
a particular case of minimum-cost network flows that can be solved much more 
efficiently through specialized algorithms [1, Ch. 4-5]. The main drawback of 
that approach based on shortest-paths was that it could only be applied to gen- 
eral tables (i.e., cell values can be either positive or negative), which are the less 
common in practice. 

To avoid the above lacks of current network flows heuristics (namely, the 
efficiency of those based on minimum-cost flows subproblems, and the suitability 
of that based on shortest-paths for positive tables) we present a new method 
that sensibly combines and improves ideas of previous approaches (mainly [2] 
and [4]). The resulting method applies to positive tables and formulates shortest- 
paths subproblems. As shown by the computational results, it is much faster than 
previous network-flows heuristics for positive tables. The new approach has been 
included in the r-Argus package [8] in the scope of the European Union funded 
“CASC” project IST-2000-25069, for the protection of two-dimensional tables 
with at most one hierarchical dimension. 

This paper is organized as follows. Section 2 outlines the formulation of CSP. 
Section 3 briefly shows how to model a two-dimensional table with at most one 
hierarchical dimension as a network. Section 4 shows the heuristic introduced in 
this work. Finally, Section 5 reports the computational experience in the solution 
of large instances, showing the efficiency of the method. 

2 Formulation of CSP 

Given a positive table (i.e., a set of cells > 0, z = 1 . . . n, satisfying some linear 
relations Aa = b), a set V of p primary sensitive cells to be protected, and upper 
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and lower protection levels upU and Ipli for each primary cell i = 1 . . .p, the 
purpose of CSP is to find a set S of additional secondary cells whose suppression 
guarantees that, for each p &P, 

Up < Up — Iplp and > Up + uplp, (1) 

Up and being defined as 

Op = min Xp cl^ 

Xi,i=l...n 

s.t. Ax = b and 

Xi > 0 i € V U S 
Xi = Qi i ^ V LI S 

Up and in (2) are the lowest and greatest possible values that can be deduced 
for each primary cell from the published table, once the entries in P U 5 have 
been suppressed. Imposing (1), the desired level of protection is guaranteed. CSP 
can thus be formulated as an optimization problem of minimizing some function 
that measures the cost of suppressing additional cells subject to that conditions 
(1) and (2) are satisfied for each primary cell. 

CSP was first formulated in [11] as a large MILP problem. For each entry a 
binary variable yi,i = 1 ... n is considered, yt is set to 1 if the cell is suppressed, 
otherwise is 0. For each primary cell p G V, two auxiliary vectors x^’P G M" and 
G IR” are introduced to impose, respectively, the lower and upper protection 
requirements of (1) and (2). These vectors represent cell deviations (positive or 
negative) from the original Oi values. The resulting model is 

n 

min ^ WiPi 

i-l 

s.t. , 

Ax^’P = 0 

-aiPi < x’’f < Mpi 
x\;P < -Iplp 

Ax^’P = 0 

-QiPi < x^’^ < Mpi 

Xp’P > uplp 
Pi G {Oj 1}) 

Wi being the information loss associated to cell a^. In practice they are usually 
set to Wi = Qi (minimize the overall suppressed value) or Wj = 1 (minimize the 
overall number of suppressed cells). Inequality constraints of (3) with both right 
and left-hand sides impose the bounds of x\’^ and when pi = I {M being a 
large value), and prevent deviations in nonsuppressed cells (i.e., pi = 0). Clearly, 
the constraints of (3) guarantee that the solutions of the linear programs (2) 
will satisfy (1). (3) gives rise to a very large MILP problem even for tables of 



i = 1 . . . n 



> for each p G V 



(3) 



i = 1 . . .n 



max Xp 

s.t. Ax = b 

Xi > 0 i gV U S 
Xi = Qi i LI S. 



(2) 
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moderate sizes and number of primary cells. If matrix A can be modeled as a 
network, we can find a feasible non-optimal (but hopefully good) solution to ( 3 ) 
through a network- flows heuristic. 

3 Modelling Tables as Networks 

It is well-known that he linear relations of a two-dimensional (r -|- 1 ) x (c -I- 1 ) 
table defined by system Ax = b can be modeled as the network of Figure 1 . 
Arcs are associated to cells and nodes to constraints; row r-|- 1 and column c-l- 1 
correspond to marginals. 




Fig. 1. Network representation of a (r -|- 1) x (c -I- 1) table 

Two-dimensional tables with one hierarchical dimension can also be modeled 
through a network [ 10 ]. Without loss of generality we will assume that hierar- 
chies appear in rows. Before presenting the overall algorithm for computing the 
network, we first illustrate it through the example of Figure 2 . Row i?2 of table 
Ti has a hierarchical structure: i?2 = R21 + R22- The decomposition of i?2 is 
detailed in T2. And row i?2i of table T2 is also hierarchical; T3 shows its struc- 
ture. Although in the example all the tables have the same number of rows, this 
is not required in general. However, the number of columns must be the same 
for all the tables; otherwise, we would not preserve the hierarchical structure in 
only one dimension. Clearly every subtable can be modeled through a network 
similar to that of Figure 1 . 

The hierarchical structure tree of the example is shown in Figure 3 . That tree 
has three levels, and one table per level. In general, we can have hierarchical 
tables of any number of levels, and any number of tables per level (i.e., any 
number of hierarchical rows for each table) . 

The procedure first computes a list of all the tables in the hierarchical struc- 
ture tree, using a breadth-first order. If we denote it by BFL, we have for the 
example BFL = {Ti,T2,T3}. Next, the first table of the list is extracted (Ti), 
which is always the top table of the tree (i.e., that with the highest level of ag- 
gregation) . An initial network is created for Ti . This is the hierarchical network, 
which will be successively updated. It is depicted in table a) of Figure 4 . For the 
rest of tables in BFL, we start an iterative procedure that extracts one table 
per iteration, creates its network, and updates the hierarchical network. When 
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Ti T2 Ta 

Cl C2 Ca Cl C2 Ca Ci C2 Ca 



Ri 


5 


6 


11 


R21 


8 


10 


18 


A211 


6 


6 


12 


R2 


10 


15 


25 


R22 


2 


5 


7 


R212 


2 


4 


6 


R3 


15 


21 


36 


R2 


10 


15 


25 


R21 




10 


18 



Fig. 2. Two-dimensional table with hierarchical rows made up of three (2 -|- 1) x (2 -|- 1) 
subtables, Ti, T2 and T3 



Ti 

R2 

T 

T2 

R2I 

T 

Ta. 



Fig. 3. Hierarchical structure tree of the example of Figure 2 



BFL is empty, the hierarchical network models the hierarchical table. In the 
example, the first iteration extracts T2, and creates the network b) of Figure 4. 
To update the hierarchical network we only need the parent table of T2 (i.e., Ti) 
and the row in the parent table expanded by T2 (i.e., i?2)- The node associated 
to row i?2 of table Ti in the hierarchical network will be the linking node for 
the insertion of the network of T2 . We now have ( 1) to remove node R2 and the 
grand total arc in the network of T2; and (2) to replace node R2 of Ti in the 
hierarchical network by the previously updated network of T2. This insertion 
procedure is shown in network c) of Figure 4. Table would be inserted in a 
similar way in the last iteration. The overall algorithm is shown in Figure 5. 

More complex tables (e.g., two-dimensional tables with hierarchies in both 
dimensions) can be modeled as a network with additional side constraints. Al- 
though the structure of such side constraints is simple {xi = Xj, i.e., the value of 
cell i is equal to the value of cell j), we can have a fairly large number of them. 
In that case, specialized algorithms for network flows with side constraints can 
show a similar performance than general state-of-the-art dual simplex imple- 
mentations. Network-flows heuristics for CSP will, in general, not show a good 
performance in these situations. 

4 The Shortest-Paths Heuristic 

The heuristic sensibly combines and improves the approaches suggested in [2] 
and [4]. We will use the notation a = (s,t) for an arc a with source and target 
nodes s and t. After modelling the table as a network, for each cell two 
arcs xf = (s, t) and x~ = {t, s) are defined, which are respectively related to 
increments and decrements of the cell value. Arcs x~^ are clockwise, and are those 
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Fig. 4. First iteration of the procedure for computing the network associated to the 
example of Figure 2. a) Network of table T\. b) Network of table T 2 . c) Network after 
including T 2 in T\. Nodes of T 2 with the same name in Ti are marked with a 



Algorithm Compute_hierarchical_network(tree of 2D tables) 

1 Compute BFL, breadth-first list of tables in tree; 

2 To = extract_first(BFL)\ 

3 N — network2D(To ); // A is the hierarchical network 

4 while BFL is not empty do 

5 T= extract_first(BFL)\ 

6 Nt = network2D(T )\ 

7 P= parentBable-of(T )\ 

8 r= row_in_parent_table(P ,T )\ 

9 Update Nt removing grand total arc and node of row r; 

10 Update N replacing node of row r of P by Nt; 

11 end_while 
End_algorithm 



Fig. 5. Algorithm for computing the network of a two-dimensional table with at most 
one hierarchical dimension 



that appear in the Figures 1 and 4. Arcs x~ would be obtained in those figures 
by changing the sense of the arrows of the arcs. 

Figure 6 shows the main steps of the heuristic. Through the process, it up- 
dates the set of secondary cells S, and two vectors Clpl and Cupl with the 
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Algorithm Shortest-paths Heuristic for CSP (Table, V ,upl,lpl) 

1 S = 0; Clpk — 0, Cupk = 0, i £V\ 

2 for_each p aV do 

3 Find source and target nodes of primary arc = (s,t); 

4 for_each type of protection level * G {Ipl, upt] do 

5 rr = 0; 

6 while (C*p < *p) do 

7 Set arc costs; 

8 Compute the shortest-path SP from t to s; 

9 T = {cells associated to arcs G SP}-, 

10 Update Clpk and Cupk, i € {V PiT) U {p}; 

11 S --SUT\P-, 

12 rr:=rruT; 

13 end_while 

14 end_for_each 

15 end for each 
End_algorithm 



Fig. 6. Shortest-paths heuristic for CSP in positive tables 



current lower and upper protection of all the primaries. The heuristic performs 
one major iteration for each primary cell p € V (line 2 of Figure 6), and, unlike 
other approaches, deals separately with the lower and upper protections (line 4) . 
If not already done by previous primaries, p is protected through one or possibly 
several minor iterations (lines 6-13). At each minor iteration the arc costs of the 
network are set. Arcs related to cells that can not be used are assigned a very 
large cost. This is the only information to be updated for the network, unlike 
previous approaches based on minimum-cost network flows problems, which also 
modified node injections and arc bounds. A shortest-path from t to s is com- 
puted, where = (s,t). The set S of secondary cells is updated with the cells 
associated to arcs in the shortest-path (line 11). To avoid the solution of un- 
necessary shortest-path subproblems for next primaries, we update not only the 
protection levels of p, but also of all the primary cells in the shortest-path (line 
10). This is a significant improvement compared to other methods. If several 
shortest-paths problems are needed for p, cells in previously computed shortest- 
paths for this primary must not be used (otherwise we can not guarantee the 
protection of the cell). To this end, TT in Figure 6 maintains the list of cells 
already suppressed for the protection of p. 

We next discuss some of the relevant points of the heuristic. 

— Protection Provided by the Shortest-Path. The shortest-path SP from 

t to s is a list of I arcs a;* , * being + or “ depending on the 

arc orientation, such that x*^ = (t,Uj), cc* = (si^,s), and x*. = (ti„_i , ) 

for all j = 2, . . . , ^ — 1. T = {i\,. . . , ii} is the set of cells associated to the 
arcs in the shortest-path. Defining 
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7 = min{op, : ij G T}, (4) 

then we can send a flow 7 through the shortest-path in either direction. That 
means we can increase or decrease ap by 7 without affecting the feasibility of 
the table. If 7 > ma,x{lplp, uplp}, it follows from (2) that this cell is protected 
by this shortest-path. This is basically the approach used in [4]. 

However, our heuristic exploits better the information provided by the shor- 
test-path. It separately computes 

7 + = vaiYi{ap,ai. : xf. G SP} 7 “ = min{ai^. : x~ G SP}. (5) 

If there is no arc x~. in SP, then 7 “ = 00 . 7 + gives the amount cell p can be 
decreased without obtaining a negative cell. It is thus the lower protection of 
p provided by this shortest-path. Analogously, the upper protection provided 
is 7 “ . That permits to update separately and with different protections the 
lower and upper levels. One immediate benefit of this procedure is that the 
heuristic can deal with upper protections greater than the cell value (i.e., 
uplp > Up). Such large protections are often used for very small cell values. 
For instance, if only arcs xf, appear in SP, it is possible to infinitely increase 
the value of cell p without compromising the feasibility of the table. Indeed, 
in this case the upper protection level provided by the heuristic is 7 “ = 00 . 
This can not be done with (4). Current protection levels Clpl and Cupl oi p 
and primary cells in P are updated using (5) in Figure 6. 

Arc Costs. The behaviour of the heuristic is governed by the costs of arcs 
xf and x~ associated to cells at. Arcs not allowed in the shortest-path are 
assigned a very large cost. For instance, this is done for arc x~ to avoid 
a trivial shortest-path from t to s. As suggested in [4], costs are chosen to 
force the selection of: first, cells i £ V US and Ui > (* = Ipl or * = upl, 

following the notation of Figure 6); second, cells i ^ V U S and Ui > *p; 
third, cells i £ V US and ai < *p] and. Anally, cells i ^ V US and Oj < *p. 
This cost stratification attempts to balance the number of new secondary 
suppressions and shortest-paths subproblems to be solved. Clearly, for each 
of the above four categories, cells with the lowest Oj values are preferred. 
The particular costs used by the heuristic can be computed in a single loop 
over the n cells; those suggested in [4] required two such loops. 
Shortest-Path Solver. Shortest-paths subproblems were solved through an 
efficient implementation of the Dijkstra’s algorithm [1, Ch. 4]. Since we are 
interested in the shortest-path to a single destination, a bidirectional version 
was used. In practice, this can be considered the most efficient algorithm for 
these kind of problems. As shown in the computational results of Section 5, 
it is orders of magnitude faster than minimum-cost network flows codes for 
large instances. 

Lower Bounding Procedure. We also included an improved version of 
the lower bounding procedure introduced in [11]. It is used to obtain a lower 
bound on the optimal solution. The relative gap between this lower bound 
and the solution provided by the heuristic can be used as an approximate 
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Fig. 7. CPU time vs. number of cells for two-dimensional instances 



measure of how far we are from the optimum. Unlike that of [11], our pro- 
cedure includes, for all primary cell p, two constraints that force the value 
of the other primary and secondary cells in the same row and column to 
be greater than maxjZpZp, up/p} (otherwise, p is not protected). This new 
formulation provides slightly better lower bounds. The procedure, originally 
developed for two-dimensional tables, was extended to deal with hierarchical 
ones. 



5 Computational Results 

The heuristic of Section 4 has been implemented in C. It is currently included 
in the t- Argus package [8] for tabular data protection. For testing purposes, 
two instances generators for two-dimensional tables were developed, similar to 
that of [11] (see [3] for details). A generator for hierarchical tables was also 
designed and implemented. They can be obtained from the author on request. We 
produced 54 two-dimensional instances, ranging from 62500 to 562500 cells, and 
with p S {1000, 2000, 3000}, p being the number of primary cells. Cells weights 
were set to Wi = Qi (i.e., cell value). We also generated 72 two-dimensional 
hierarchical tables, ranging from 1716 to 246942 cells and from 4 to 185 subtables, 
with p G {500, 1000}. Cells weights were set to Wi = ai in half of the instances 
and rci = 1 in the remaining ones. In all the cases the lower and upper protection 
levels were a 15% of the cell value. Primary cells were assigned a value much 
lower than for the other cells, which is usual in real data. All the executions were 
carried on a standard PC with a 1.8 GHz Pentium-4 processor. Preliminary runs 
on real data sets have only been performed for small two-dimensional tables [12]. 
In those instances, the shortest-paths heuristic was more efficient and provided 
better solutions than the hypercube method [7]. 

The results obtained are summarized in Figures 7-12. Figures 7-8 show, 
respectively for the two-dimensional and hierarchical tables, the CPU time in 
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p- 1000 
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Fig. 8. CPU time vs. number of cells for hierarchical instances 




#cells 



♦ p- 1000 
▼ p^ 2000 

* p- 3000 



Fig. 9. CPU ratio between CPLEX and Dijkstra vs. number of cells for two-dimensional 
instances 



seconds vs. the number of cells of the table, for the different number of primary 
cells. Clearly, the CPU time increases with both p and the number of cells. 
However, the shortest-path heuristic was able to solve all the instances in few 
seconds. 

Figures 9-10 show, again for the two-dimensional and hierarchical tables, the 
efficiency of the shortest-path heuristic compared to other methods. We applied 
the heuristic twice, formulating the subproblems as minimum-cost network flows 
(as previous approaches did), and as shortest-paths. The minimum-cost network 
flows subproblems were solved with the network simplex option of CPLEX 7.5, 
a state-of-the-arc implementation. The largest instances were not solved because 
CPLEX required an excessive execution time. The vertical axes of the figures 
show the ratio of the CPU time of CPLEX 7.5 and the implementation of Di- 
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#cells (log scale) 



Fig. 10. CPU ratio between CPLEX and Dijkstra vs. number of cells for hierarchical 
instances 




#cells 

Fig. 11. Gap vs. number of cells for two-dimensional instances 



jkstra’s algorithm used in the heuristic. For the two-dimensional tables we plot 
separately the instances for the different p values. Similarly, two lines are plot- 
ted in Figure 10, one for each type of weights {tw = 1 and tw = 2 in the figure 
correspond to Wi = and Wi = 1, respectively). We observe that the ratio 
time increases with the table dimension, and it is of about 1900 for the largest 
instance. Therefore, it can be clearly stated that the shortest-paths formulation 
is instrumental in the performance of the heuristic. 

Finally, Figures 11-12 show, for respectively the two-dimensional and hier- 
archical tables, the quality of the solution obtained. The vertical axes show the 
relative gap {ws — lb) /ws, ws being the weight suppressed by the heuristic, and 
lb the computed lower bound. Those figures must be interpreted with caution. 
At first sight, it could be concluded that the heuristic works much better for 
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#cells (log scale) 

Fig. 12. Gap vs. number of cells for hierarchical instances 



two-dimensional than for hierarchical tables. However, the lower bounding pro- 
cedure could be providing better bounds for two-dimensional tables. It is thus 
difficult to know which factor ~ the quality of the heuristic or the quality of the 
lower bounding procedure - explains the much larger gap for hierarchical tables. 
Both factors likely intervene, and in that case network flows heuristics (either 
formulating minimum-cost or shortest paths subproblems) would behave better 
for two-dimensional than for hierarchical tables. 

6 Conclusions 

From our computational experience, it can be concluded that shortest-paths 
heuristics are much more efficient than those based on minimum-cost network 
flows formulations. It was also observed that such heuristics seem to work better 
for two-dimensional than for hierarchical tables. To avoid over-suppression, the 
former would require a clean-up procedure, similar to that of [11]. However, 
unlike previous approaches, and for efficiency reasons, this clean-up procedure 
would only solve shortest-paths subproblems. This is part of the future work to 
be done. 
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Abstract. Noise addition is a family of methods used in the protection 
of the privacy of individual data (microdata) in statistical databases. 
This paper is a critical analysis of the security of the methods in that 
family. 
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disclosure control, Statistical disclosure limitation. Data security. Re- 
spondents’ privacy. 



1 Introduction 

Privacy in statistical databases is about finding tradeoffs to the tension between 
the increasing societal and economical demand for accurate information and the 
legal and ethical obligation to protect the privacy of individuals and enterprises 
which are the source of the statistical data. To put it bluntly, statistical agen- 
cies cannot expect to collect accurate information from individual or corporate 
respondents unless these feel the privacy of their responses is guaranteed; also, 
surveys of web users [6,7,11] show that a majority of these are unwilling to 
provide data to a web site unless they know that privacy protection measures 
are in place. 

To achieve privacy without losing accuracy, statistical disclosure control 
(SDC) methods must be applied to data before they are released [16]. Data 
in statistical databases are of two kinds: tables (aggregate data) data and micro- 
data (individual respondent data). SDC methods for microdata are also known 
as masking methods. Example masking methods include noise addition, resam- 
pling, blanking, imputation, microaggregation, etc. (see [5] for a survey). 



1.1 Contribution and Plan of This Paper 

This paper concentrates on masking methods based on noise addition and an- 
alyzes their security. It will be shown that the distribution of the original data 
can be reconstructed from the masked data, which in some cases may lead to 
disclosure. In fact, another critique to the security of noise addition has recently 
been published for the univariate case ([8]); using a different approach, we show 
that multivariate noise addition is not free from problems either. 
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Section 2 reviews noise addition methods for privacy protection. Section 3 
presents a procedure for recontructing the original distribution of the original 
multivariate dataset given the masked data set and the noise distribution. Sec- 
tion 4 shows how the previous procedure can be applied to attack most practical 
noise addition methods. Empirical results of attacks are presented in Section 5. 
Section 6 is a conclusion. The Appendix contains the mathematical derivation 
of the reconstruction procedure given in Section 3. 



2 Noise Addition Methods for Privacy Protection 

We will sketch in this section the operation of the main additive noise algorithms 
in the literature. For a more comprehensive description and a list of references, 
see the excellent survey [3]. 



2.1 Masking by Uncorrelated Noise Addition 

Masking by additive noise assumes that the vector of observations Xj for the 
j-th variable of the original data set Xj is replaced by a vector 

+ ^3 ( 1 ) 

where €j is a vector of normally distributed errors drawn from a random variable 
Ej ~ such that Cov{et,£i) = 0 for all t I (white noise). 

The general assumption in the literature is that the variances of the Ej are 
proportional to those of the original variables. Thus, if crj is the variance of Xj, 
then CTp. := acrj. 

In the case of a p-dimensional data set, simple additive noise masking can be 
written in matrix notation as 

Z = X + E 

where X ~ (^, A), e ~ N{0, S^) and 

Ag = a • diag{al, cr^, • • • , (7^), for a > 0 
This method preserves means and covariances, i.e. 

E{Z) = E{X) + E{e) = E{X) = /x 



Cov{Z,,Zi) = Cov{Xj,Xi) Vj ^ I 

Unfortunately, neither variances nor correlation coefficients are preserved: 



V{Z,) = V(X,) + aV{Xj) = (1 + a)V{X,) 



Cov{Z,,Zi) 

VnxjWm 



1 

1 -|- Cf 



Vj yf I 
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2.2 Masking by Correlated Noise Addition 

Correlated noise addition also preserves means and additionally allows preserva- 
tion of correlation coefficients. The difference with the previous method is that 
the covariance matrix of the errors is now proportional to the covariance matrix 
of the original data, i.e. e ~ N{0, S^), where = aS. 

With this method, we have that the covariance matrix of the masked data is 

Sz = S + aS={l + a)S (2) 

Preservation of correlation coefficients follows, since 

1-ha Cov{Xj,Xi) 

- l + a^V{X,)V{xT) ~ 

Regarding variances and covariances, we can see from Equation (2) that masked 
data only provide biased estimates for them. However, it is shown in [10] that 
the covariance matrix of the original data can be consistently estimated from 
the masked data as long as a is known. 

As a summary, masking by correlated noise addition outputs masked data 
with higher analytical validity than masking by uncorrelated noise addition. 
Consistent estimators for several important statistics can be obtained as long as 
a is revealed to the data user. However, simple noise addition as discussed in 
this section and in the previous one is seldom used because of the very low level 
of protection it provides [14, 15]. 

2.3 Masking by Noise Addition and Linear Transformations 

In [9] , a method is proposed that ensures by additional transformations that the 
sample covariance matrix of the masked variables is an unbiased estimator for 
the covariance matrix of the original variables. The idea is to use simple additive 
noise on the p original variables to obtain overlayed variables 

^3 = ^3 + foi'i = 1: • • • ,P 

As in the previous section on correlated masking, the covariances of the errors Sj 
are taken proportional to those of the original variables. Usually, the distribution 
of errors is chosen to be normal or the distribution of the original variables, 
although in [12] mixtures of multivariate normal noise are proposed. 

In a second step, every overlayed variable Zj is transformed into a masked 
variable Gj as 

Gj — cZ j -\- dj 

In matrix notation, this yields 



Z = X + e 

G = cZ + D = c{X + e) + D 
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where X ~ e ~ {0,aS), G ~ and D is a, matrix whose j-th 

column contains the scalar dj in all rows. Parameters c and dj are determined 
under the restrictions that E{Gj) = E(Xj) and V (Gj) = V (Xj) for j = 1, • • • ,p. 
In fact, the first restriction implies that dj = (1 — c)E{Xj), so that the linear 
transformations depend on a single parameter c. 

Due to the restrictions used to determine c, this methods preserves expected 
values and covariances of the original variables and is quite good in terms of an- 
alytical validity. Regarding analysis of regression estimates in subpopulations, it 
is shown in [10] that (masked) sample means and covariances are asymptotically 
biased estimates of the corresponding statistics on the original subpopulations. 
The magnitude of the bias depends on the parameter c, so that estimates can 
be adjusted by the data user as long as c is revealed to her. 

The most prominent shortcomings of this method are that it does not pre- 
serve the univariate distributions of the original data and that it cannot be 
applied to discrete variables due to the structure of the transformations. 



2.4 Masking by Noise Addition and Nonlinear Transformations 

An algorithm combining simple additive noise and nonlinear transformation is 
proposed in [13]. The advantages of this proposal are that it can be applied to 
discrete variables and that univariate distributions are preserved. 

The method consists of several steps: 

1. Calculating the empirical distribution function for every original variable; 

2. Smoothing the empirical distribution function; 

3. Converting the smoothed empirical distribution function into a uniform ran- 
dom variable and this into a standard normal random variable; 

4. Adding noise to the standard normal variable; 

5. Back-transforming to values of the distribution function; 

6. Back-transforming to the original scale. 

In the European project CASC (IST-2000-25069), the practicality and usabil- 
ity of this algorithm was assessed. Unfortunately, the internal CASC report [4] 
concluded that 

All in all, the results indicate that an algorithm as complex as the one 
proposed by Sullivan can only be applied by experts. Every application 
is very time-consuming and requires expert knowledge on the data and 
the algorithm. 



2.5 Section Summary 

Thus, in practice, only simple noise addition or noise addition with linear trans- 
formation are used. When using linear transformations, the parameter c de- 
termining the transformations is revealed to the data user to allow for bias 
adjustment in the case of subpopulations. 
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3 Reconstructing the Original Distribution 
from Noise-Added Multivariate Data 

In [1], an algorithm for reconstructing the original distribution of univariate 
data masked by noise addition is presented. In [2], a reconstruction algorithm 
with the same purpose is proposed which not only converges but does so to 
the maximum likelihood estimate of the original distribution. When comparing 
both proposals, it turns out that, even though the algorithm in [2] has nicer 
theoretical properties, the algorithm in [1] lends itself better to a multivariate 
generalization. Such a generalization is presented in this section. 

We consider an original multivariate data set consisting of n records Xi,X 2 , 

• ■ • , with p dimensions each {xi = {x}, - ■ ■ ,x^)). The n original records are 
modeled as realizations of n independent identically distributed random variables 
X\^X 2 , - ■ ■ ,Xn, each with the same distribution as a random variable X. To 
hide these records, n independent p- variate variables Yi,y 2 ,’’’ are used, 
each with the same distribution as a random variable Y. The published data Z 
set will be zi= xi+ yi,Z2 = X2 + y2, - ■ ■ Xn = x„ + y-n- 

The purpose of our algorithm is to estimate the density function of the data 
set X (namely /^(a^, • • • , a^)) from our knowledge of the published data Z and 
the density function fy- 

Rather than estimating itself, the algorithm estimates the probability 
that X takes values in a certain interval. The technique is a multivariate gener- 
alization of the approach described in [1] for the univariate case (p = 1). 

We first partition into p-dimensional intervals (the i-th dimension is di- 
vided into fc* subintervals) and consider the grid formed by the midpoints of 
such intervals. Let ,oP be the interval in which (a^,--- , 0 ^) lies, and let 
to(/q 1 ... op) be the midpoint of The following approximations will be 

used: 

— /v(a^, • • • , aP) will be approximated by ^p)) 

“ - ■ ■ ,aP) will be approximated by the average of fx over the interval 

in which (a^, • • • , a^) lies. 

Let N(Iqi ... qp) be the number of points in Z that lie in interval Iqi^...^qp, 
i.e. the number of elements in the set {(z},- " ) ^ ,q^} 

We can now state the reconstruction algorithm as follows (see Appendix for 
details on the mathematical derivation of the algorithm): 

Algorithm 1 (p-dimensional reconstruction algorithm) 

1. Let Pr^{X € Lqi ... qp) be the prior probability that X takes a value in 
Iq^,---,qp- For convenience, compute Pr^ taking as prior distribution for X 
the p-dimensional uniform distribution over a p-dimensional interval in K. 

2. Let j = 0 (j is the iteration counter). 
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3. Repeat 

(a) Pr^+i(XG 



1 

n 



fci fcP 



si = l sP = l 



)x 



^ fY{m{ls\... ,sp) - ^gp))Pr^X € Igl^...^gp) 

X)ti=i ■ ■ ■ X)tp=i ,sp) ~ tp))Prt (X G ^tp) 

(b) j:=j+l 

until stopping criterion met. 



4 Attacking Noise Addition Methods 

As pointed out in Section 2.5 above, the noise addition approaches used in prac- 
tice are limited to correlated noise addition and noise addition combined with 
linear transformations. Our attacks will be targeted at those two approaches. 



4.1 Attacking Correlated Noise Addition 

As noted in Section 2.2, for consistent estimators to be obtained from data 
masked with correlated noise addition, the parameter a must be revealed to the 
data user. If a user knows a, we have from Equation (2) that she can estimate 
the covariance matrix aS of the added noise as aS, where 

S = {l + a)-^Ez 

and Ez is the sample covariance matrix estimated from the masked data. 

In this way, a user is able to estimate the distribution of the noise Y added 
when masking the data as a N{0,aE) distribution. Using the masked data set 
and the estimated noise distribution, the user is in a position to run Algorithm 1 
to estimate the distribution of the original data X. 

Now, if the user is malicious, she is interested in determining intervals 
which are narrow in one or several dimensions and for which the number of 
records in the original data set that fall in the interval is one or close to one. Note 
that the number of original records {x\, • • • , xtp) that lie in interval ,q^ can 
be estimated using the distribution of the original data as NPr{X G 
where N is the total number of individuals in the original data set. Such intervals 
disclose the value of one or more attributes for the individual(s) behind the 
record(s) lying in the interval. 

4.2 Attacking Masking by Noise Addition 
and Linear Transformation 

It was pointed out in Section 2.3 that the linear transformations used to comple- 
ment correlated noise addition actually depend on a single parameter c, which 
must be revealed to the user if the latter is to adjust for biases in subpopulation 
estimates. 
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Now, a user can use her knowledge of c to estimate the independent term dj 
of each linear transformation as 

dj = (1 - c)E{Gj) 

where E{Gj) is the sample average of the j-th masked variable. Next, the user 
can undo the linear transformation to recover an estimate of the j-th overlayed 
variable Zj: 

Zj = c-\Gj - dj) 

In this way, the values taken by all overlayed variables can be estimated by the 
user. Note that the Zj, for j = 1 to p, are the result of adding correlated noise 
to original variables Xj. Now, by the choice of c, 

X = Eg = Ex = (1 + a)~^Ez 



Therefore, the parameter a can be estimated as 



a = 



1 ^ V{Zj) 

ph 



1 ) 



where V{Zj) and V{Gj) are the sample variances of Zj and G^, respectively. 

Armed with the knowledge of the masked data and the distribution of the 
noise, which is estimated as ~ (0, dA), the user can run Algorithm 1 to esti- 
mate the distribution of the original data. The rest of the attack proceeds as in 
Section 4.1 above. 



5 Empirical Results 

Consider an original data set of n = 5000 bivariate records xi,X 2 ,--- ,a; 5 oooj 
that is, Xi = (xj,x^) for i = I,-- - ,5000. The first variable is the annual 
income of the individual to which the record corresponds. The second variable 
X^ is the amount of the mortgage borrowed by that individual. Figure 1 depicts 
the histogram of the discretized original data set: = 49 are the number 

of intervals considered for each dimension. The histogram clearly shows that 
25% of individuals (those in the lower peak of the histogram) have mortgages 
substantially higher and incomes substantially lower than the remaining 75% of 
the individuals. 

Figure 2 shows the histogram of a masked data set Zj = (zl,zf), for i = 
1, • • • , 5000, obtained after adding correlated noise with a = 1.167 to the original 
data set. It can be seen that both peaks in the original data set collapse into a 
single peak in the masked data set, so that the minority with high mortgages 
and low income is no longer visible. Thus, publication of the masked data set 
protects the privacy of the individuals in that minority. 

The histogram in Figure 3 has been reconstructed using Algorithm 1 on the 
published masked data set. The minority with low income and high mortgage is 
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Fig. 1. Histogram of the original data set 



clearly detected by the reconstruction algorithm, as can be seen from the figure. 
Thus, individuals in this minority are more identifiable that it would seem when 
merely looking at the published data set. 

It may be argued that discovering that there is a 25% minority with low 
income and high mortgage in a population of 5000 individuals does not lead to 
disclosure of any particular individual. This may be true but, if more variables 
are available on the individuals in the data set, a reconstruction with higher p is 
feasible, which would lead to a segmentation of that minority and eventually to 
very small minorities (with a few or even a single individual). Thus, the higher 
the dimensionality p of the reconstruction, the more likely is disclosure. 

6 Conclusions and Future Research 

We have shown that, for noise addition methods used in practice, it is possible for 
a user of the masked data to estimate the distribution of the original data. This 
is so because the masking parameters must be published in order for estimates 
obtained from the masked data to be adjusted for consistency and unbiasedness. 
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Fig. 2. Histogram of the data set masked by correlated noise addition 



Estimation of the distribution of original data can lead to disclosure, as shown 
in the section on empirical results. 

The main consequence of this work is that, as currently used, masking by 
noise addition may fail to adequately protect the privacy of individuals behind 
a microdata set. 

Future research will be directed to carrying out experiments with real data 
sets and higher dimensionality p. A higher p is needed to increase the likelihood 
of partial/total reconstruction of individual original records. Since the compu- 
tational requirements increase with p, substantial computing resources will be 
necessary for this empirical work. 
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Appendix. Multivariate Generalization 
of the Agrawal-Srikant Reconstruction 



The notation introduced in Section 3 will be used in this Appendix. Let the value 
of Xi J- Yi be Xi yi = Zi = {z} , ■ ■ ■ , zf). The sum Zi is assumed to be known, 
but Xi and yi are unknown. If knowledge of the densities fx and fy is assumed, 
the Bayes’ rule can be used to estimate the posterior distribution function 
for Xi given that Xi~\-Yi = Z\ = (z^, - ■ ■ ,zf): 



F'xA 



J ■■■ J - ,wP\Xi -h Yl = {zl,--- , z^))dw^ ■ ■ ■ dwP 

— oo — oo 



fx,+YA4,--- 



— OO 



— oo 
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= ,wP))fxAw\--- ,wP)dw^---dwP 

fZo'" = ,wP))fx^{w\--- ,wP)dw^---dwP 






,zf-wP)fx,A 



, wP)dw^ ■ ■ ■ dvjP 



fZo • • • fZo fyi • • • , zf - wP)fx, , wP)dw^ ■■■dwP 

Since fxi = fx and fy^ = fy, the posterior distribution function F'^ given 
xi+ yi, • • • , Xn + y-a can be estimated as 



1 



F'x{<Ay ,aF) = - VF^.(a\--- ,aF) 



Z=1 



_ 1 I-oo ■ ■ ■ I-oo fyjzj ,zf - wP)fx{w^, • • • , wP)dw^ ■■■dwP 

^ hi iZo ■ ■ ■ iZoo friz} ,zf - wP)fx{w^, • • • , wP)dw^ ■■■dwP 

Now, the posterior density function can be obtained by differentiating F'^' 



fx{a\--- ,aP) = 



(3) 



= -E 

71 



friz} - 



,zf - aP)fx{o 



A 



^ fZ^ ■ ■ ■ fZ>o fy(Z - • • • , zf - wP)fx('w\ ■■■ , vjP)dw^ ■■■dvjP 

Equation (3) can be used to iteratively approximate the unknown real density 
fx, assuming that fy is known. We can take the p-dimensional uniform distri- 
bution as the initial estimate of fx on the right-hand side of Equation (3); 
then we get a better estimate f'x on the left-hand side. In the next iteration, we 
use the latter estimate on the right-hand side, and so on. 

By applying the interval approximations discussed in Section 3 to Equation 
(3), we obtain 

f'x{a\--- ,aP) = (4) 



/v(TO(4i,...,zf) - m{Iai ,aP ))fx{Ia^,--- ,ap) 



= ±y 

^ h fZ,^ ■ ■ ■ fZ>c fYimAzl- ,zf) - ,wp))fx(Iw\- ,wp)dw^ ■ ■ ■ dwP 

Let Vq^^... be the p-dimensional volume of interval , 5 ^. We can replace 
the integral in the denominator of Equation (4) with a sum, since m(/„y... 
and f{I{w^,--- , wPf) do not change within an interval: 

fx{a\--- ,aP) = 

_ 1 -A /v(m(4i,...,zf) - to(/o1 ,aP ))fx{Ia^,--- ,ap) 

” I]ti=i ■ ■ ■ fr(m(Izl,-- . 2 ?) “ ,tp))fx{It^,- ,tp)Vt^,- ,tp 

The average value of f'x can now be computed over the interval /qI,... ,op as 



f'xi.Ia^,- ,ap) — J 



,wP)dw^ ■ ■ ■ dvjP /Vai ,aP = 
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/ 



1 

n 






- ,aP 



E 



ti=i ’ ’ ’ Z^tp=i 



^wv))fx{Iw\- ,wp))dw'^ ■ • • dw^ 



,tP 






■ ,aP 



Since the above formula does not distinguish between points within the same 
interval, we can accumulate those points using the counters iV(-) mentioned in 
Section 3: 



f'xila^ 



■ 



) = 



( 5 ) 



= ^E--E^(^a-.-)x 



fY{m{ds\- ,sp) - m{dg\- ,ap))fx{Ig\- ,gp) 

LEi ■ • • Z^Ei ,sp) - ^tp))Vti,... ,tp 

Finally, let Pr'{X G /ah-- ,ap) be the posterior probability of X belonging to 
interval Ia^^...^aP: *-e- 



Pr'{X G Ia^^...^ap) — ,oJ>) X h'ai.-’-.aP 

Multiplying both sides of Equation (5) by Eoy... and using that 
Pr{X G /ai,--- ,ap) = fx{Ia^,--- ,ap) X h'jji ... 

we have 

Pr'{X €lai,..,ap)= (6) 

=lj:--j:misp...sp)x 

/F(m(Jgy... ,^p) - m(/gy... ^gp))Pr{X G Igl,... ,gp) 

X)ii = l ■ ■ ■ X)iP = l ,sp) ~ ,tp))Pf{X G jtp) 

Equation (6) is the basis for the recurrence used in Algorithm 1. 
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Abstract. Microaggregation is a masking procedure used for protecting 
confidential data prior to their public release. This technique, that relies 
on clustering and aggregation techniques, is solely used for numerical 
data. In this work we introduce a microaggregation procedure for cate- 
gorical variables. We describe the new masking method and we analyse 
the results it obtains according to some indices found in the literature. 
The method is compared with Top and Bottom Coding, Global recoding. 
Rank Swapping and PRAM. 

Keywords: Privacy preserving data mining. Data protection. Masking 
methods. Clustering, Microaggregation, Categorical data 



1 Introduction 

Companies and Statistical Offices collect data from respondents (either individ- 
uals or companies) to extract rellevant information or to inform policy makers 
and researchers. However, the fulfillment of this goal has to be done assuring 
confidentiality and, thus, avoiding the disclosure of respondents’ sensitive data. 
This is, disclosure risk should be minimized. Statistical disclosure control (SDC) 
- or Inference Control - studies tools and methods (namely, masking methods) 
to allow dissemination of data protecting confidentiality. Privacy preserving data 
mining [1] is a related field with similar goals. While the former is oriented to 
statistical databases, the latter is oriented to company proprietary information. 

It is important to note that a straightforward manipulation of the data is 
not enough for avoiding disclosure because data has to maintain the so-called 
analytical validity [25]. This is, in short, that the analysis performed on the 
protected data has to lead to results similars to the ones obtained using the 
original data. In other words, information loss should be small. See [7] and [24] 
for a state of the art description of the field. 

For the purpose of data confidentiality, a plethora of masking methods have 
been designed. A comprehensive description of the methods and their properties 
is given in [4] and [24] . See also [5] for a comparative analysis of the methods with 
respect to some indices for measuring information loss and disclosure risk. In [9], 
an up to date review of the methods currently in use by National Statistical 
Offices is given. 
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Masking methods can be classified according to different dimensions. In re- 
lation to this work, it is relevant to classify methods according to the type of 
variables. Then, methods for categorical (either nominal or ordinal) variables and 
methods for numerical (continuous) variables can be distinguished. Among the 
methods for numerical variables, we can distinguish: rank swapping, microag- 
gregation and noise addition. These methods are currently in use [9] and have 
good performance [5] in relation to information loss and disclosure risk indices. 
For categorical data, existing methods include Rank swapping. Top and Bottom 
coding, recoding and PRAM (Post-Randomization Method). 

A detailed analysis of the methods (see e.g. [4]) shows that some of the 
methods for numerical variables are not appliable to categorical data (and vice- 
versa) . This is due to the intrinsic nature of the variables and the difficulties of 
translating some numerical functions {e.g. addition, averaging) into a categorical 
domain. Among those methods we find microaggregation. 

From the operational point of view, microaggregation consists on obtaining 
a set of clusters (gathering similar respondents) and, then, replacing the original 
data by the averages of all the respondents in the corresponding cluster. In this 
way, the data for each respondent is protected. For avoiding discloure, clusters 
have to contain a minimum number of respondents (otherwise the average does 
not avoid the disclosure because an individual contributing to the cluster, or an 
external individual, can guess the value of another respondent). Difficulties on 
extending this approach to categorical data rely on clustering and aggregation 
and their suitability to deal with categorical variables. 

Although, in general, the interest of translating masking methods from one 
scale to another is not clear, the case for microaggregation is different. Numerical 
microaggregation performs quite well with respect to the different existing indices 
for information loss and disclosure risk. Moreover, it is shown in [5] that it is 
the second best rated method for numerical data, just behind rank swapping. 
Therefore, it seems appropriate to consider a categorical microaggregation and 
whether this method can also lead to such similar good results. 

In this work we introduce a categorical microaggregation method and we 
show that it outperforms other masking methods for the same type of data. 
The method is based on clustering techniques (see e.g. [16, 17] and on some 
aggregation operators, both for categorical data. 

The structure of this work is as follows. In Section 2, we describe the ba- 
sic elements we need latter on for defining our categorical approach. Then, in 
Section 3, we describe the categorical microaggregation procedure. It follows, 
in Section 4, a detailed analysis of the experiments performed to evaluate our 
approach. The work finishes in Section 5 with some conclusions and future work. 



2 Preliminaries 

This section is divided in two parts. We start with a short description of mi- 
croaggregation. Then, we review some aggregation procedures that can be used 
for categorical data. 
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2.1 Microaggregation 

As briefly described in the introduction, microaggregation can be operationally 
deflned in terms of the following two steps: 

Partition: The set of original records is partitioned into several clusters in 
such a way that records in the same cluster are similar to each other and so 
that the number of records in each record is at least k. 

Aggregation: The average value (a kind of prototype) is computed for each 
cluster and used to replace the original values in the records. This is, each 
record is replaced by its prototype. 

According to this description, an actual implementation of microaggregation 
requires a clustering method and an aggregation procedure. While for numerical 
data the main difficulty is on the clustering method (most clustering methods do 
not apply because they do not satisfy the constraint about the minimal number 
of records in each cluster), for the categorical data difficulties appear in both 
processes. 

In fact, while, up to our knowledge, there is no microaggregation method for 
categorical data, there exist several methods for numerical data. Differences on 
the latter methods correspond to differences on the way clusters are built (mod- 
ification of standard techniques, novel approaches using genetic algorithms with 
an appropriate fitness function, . . . ), on the way a large set of variables is con- 
sidered (repeatedly applying univariate microaggregation - microaggregation for 
a single variable, applying multivariate microaggregation - all the variables at 
once, . . . ) or on the aggregation procedure. In relation to the aggregation proce- 
dure, while the most common method is the arithmetic mean, other procedures, 
as the median operator [18], have also been used. 

2.2 Aggregation Procedures for Categorical Data 

At present, there exist several aggregation functions for categorical data. See 
e.g., [26] for a recent survey on aggregation operators. Here we can distinguish 
between operators for nominal scales (where only equality can be used to com- 
pare elements) and ordinal scales (there is an ordering among the elements) . In 
the case of nominal scales, the main operator is the plurality rule (mode or the 
voting procedure). 

In the case of ordinal scales, operators can be classified, following [22], in 
three main classes. We review below these classes considering the ordinal scale 
L = {^ 0 ) • • • , Ir} with a total order <l (deflned as follows: Iq <l h <l ■ ■ ■ 

Ir)- 

1. Explicit quantitative or fuzzy scales: A mapping from L to a numerical (or 
fuzzy) scale (say, N) is assumed. Then, aggregation functions are deflned in 
this underlying N scale. In some cases, this numerical scale is not given but 
inferred from additional knowledge about the ordinal scale {e.g. from a one- 
to-many negation function [20]). The operator in [22] follows this approach. 
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2. Implicit numerical scale: An implicit numerical scale underlying the ordinal 
scale is assumed. Operations on L are defined as operating on the underlying 
scale. The usual case is that each category k is dealt as the corresponding 
integer i. This is the case of Linguistic OWA [11] and Linguistic WOWA [21]. 

3. Operating directly on categorical scales: Operators stick to a purely ordinal 
scale and are based only on operators on this scale. This is the case of the 
median operator or the Sugeno integral [19]. These operators solely rely on 
<L (or to minimum and maximum). Other operators in this class (e.g., the 
ordinal weighted mean defined in [10]) are based on t-norms and t-conorms 
(two operators that can be defined axiomatically over ordinal scales). 

Aggregation operators for categorical scales where reviewed and analyzed 
in [6]. Revision was focused on their application for prototype construction (a 
case similar to the one considered here). Note that the aggregation step in mi- 
croaggregation can be understood as building a centroid (a representative) for 
each cluster. In short, results show that the most relevant aggregation method 
for ordinal scales is the median (simpler to use and with a straightforward mean- 
ing) but that this operator does not allow for compensation. Recall that in this 
setting compensation implies that the aggregation of some values Ik^ G L can be 
a value in L different to the Ik^ but in the interval [min/fc^,max?fcj. Also, the 
standard definition does not include weights. Then, [6] introduced the CWM to 
consider weights and to allow for compensation. 

In this operator, a set of data sources X are assumed to supply values at 
(formally speaking Oj = f{xi)), p{xi) are the importances of the sources Xi G X. 
Additionally, a function Q is used to distort the weights. The role of Q is to 
distort the weights so that a greater importance is assigned to smaller, larger or 
central values. 

Definition 1. Let p:Ai— >-DcM&ea weighting vector, let Q be a non- 
decreasing fuzzy quantifier (a non- decreasing function Q : [0,1] — >■ [0,1] with 
Q(0) = 0 and Q(l) = ^), then the CWM operator of dimension N (CWMp : 
L) is defined as: 



CWMp{ai, - ■ ■ ,qn) = a if and only ifacc™(a) > 0.5 > acd'^ib) 

where b is the element previous to a in L (b = max{x|a; G L,x < a}) and 
where acc™{a) = ^t,<a o^cc"' (b) , acc"' is the WOW-weighting vector of {L, acc") 
and Q, acc” {a) = acc'(a)/ «cc'(&) with acc' defined as: 

acc'(a) = min(max acc(b) , max acc(b)) (1) 

b<a b>a 

and where acc{a) = X)/(a;j)=aP(^i)- 

Roughly speaking, acc accumulates the weight of each element in L, acc' 
makes this function convex (to allow for compensation) and acc" normalizes 
so that it adds to one, acd" is a manipulation of this function (through Q to 
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distort the importance of certain elements) and, finally, using acc*" the element 
that occupies the central position is selected. 

The WOW-weighting vector used in the above definition was defined as fol- 
lows: 

Definition 2 . Let (oi,Pi)i=i,Ar be a pair defined by a value and the importance 
of Qi expressed in a given domain D C K“*", and let Q be a fuzzy non- decreasing 
quantifier. Then, the WOW-weighting vector oj = (oji, • • • , cvn) for (a, p) and Q 
is defined as follows: 



a,. = 

^jeLPa-{j) 

where a is a permutation as above such that > 0^(4) • 

3 Proposed Method 

The proposed method for categorical microaggregation is based on the methods 
for clustering and aggregation defined in the previous section. The proposed 
algorithm is as follows: 

procedure microaggregation (M: data matrix; NVar: int) is 
I:= select variables to be microaggregated (M); 
for i:=l to |/| step NVar dp 

WS:= projection of M on variables (i . . . max(|/|,i-|-NVar-l)); 

WS2:= only different records from WS; 

FR:= frequency of records (WS2, WS); 

NClust:= appropriate number of clusters (WS2, FR); 
hard k-means of (WS2, FR); 
aggregation and replacement (WS2, FR, M); 
end for ; 
end procedure; 

This is, first the variables to be microaggregated are selected from the data 
matrix M. Then, groups of NVar variables are built from M defining a working 
space (WS). Then, the WS is reduced (WS2) so that only different records are 
allowed. For each record in WS2, its frequency in WS is computed and stored in 
FR. This frequency is used by the program to estimate an appropriate value for 
NClust (the number of clusters) to be used in the clustering process. Then, the 
clustering algorithm is applied. Finally, the original values are replaced by the 
new ones (the centroids of the clusters) . 

Now, we describe in more detail some of the elements that appear in the 
algorithm above: 

Clustering: Our clustering algorithm is based on the k-modes algorithm. This 
latter algorithm, designed for categorical data (see [12]), is inspired on the 
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k-means algorithm (for numerical data). In short, the method obtains the 
optimal cluster through an iterative process consisting on the following steps: 
(i) for each cluster, representatives are computed; (ii) records are assigned 
to their nearest cluster. 

The following three aspects have been considered in our implementation: 

a) To bootstrap the process, an initial partition is needed. We build it at 
random. 

b) To determine which is the nearest cluster of a record, a distance, de- 
fined as the summation of the distance between individual values, is 
used. Here, nominal and ordinal scales are differentiated. For variables 
on nominal scales, distance is defined as 1 when values are different and 
0 when equal. In ordinal scales, distance is defined according to the po- 
sition of the categories in the domain. 

c) To compute the representatives, an aggregation method is used variable 
by variable. We use the plurality rule (mode or voting procedure) in 
nominal scales. In ordinal scales, three alternatives have been considered: 
mode (as for nominal scales), the CWM (as defined above) and a CWM 
where a is the selected element if and only if acc™(a) > f3 > acc™(b) (for 
a (3 randomly selected). 

d) To assure that all final micro-clusters have a desired cardinality, some 
elements are relocated. 

Aggregation: For aggregation, we apply the same process used for computing 
cluster representatives in the clustering algorithm. 



4 Results 

In this section we describe the results obtained for our masking method. We 
start describing the methodology used to evaluate our method and then the 
experiments and the conclusions of them. 



4.1 Evaluation Method 

To evaluate our approach we have applied the methodology previously used 
in [5] and in [27] consisting in developing a score combining two measures, one for 
disclosure risk and the other for information loss. The score can be computed for 
any pair (original-file, masked-file). Then, a data file was masked using different 
masking methods (and considering different parameterizations for each masking 
method) and the scores for each pair (original-file, masked-file) were obtained 
and compared. 

According to this, we got a score for each pair (masking-method, param- 
eterization). Now we consider the computation of the score and the masking 
methods we have considered to evaluate the categorical microaggregation. We 
also describe the file used and how masked files have been constructed from this 
file. 
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The Score. The score used is the mean of an information loss measure and of 
a disclosure risk measure. The rationale of this definition is that a good masking 
method is the one that corresponds to a good trade-off of these both aspects. 
Such definition is motivated by the fact that information loss and disclosure risk 
are contradictory properties (no risk usually implies no information, and total 
information implies total risk). 

Disclosure risk was measured (following [5] and [27]) as the number of records 
re-identified using record linkage programs (we used the mean of the number of 
re-identifications obtained using two record linkage methods: probabilistic and 
distance-based). Information loss was also computed as a mean, in this case a 
mean of several information loss measures. In particular, we considered a direct 
comparison of categorical values, a comparison of the contingency tables and 
some entropy-based measures. These measures are described in detail in [5]. 

Masking Methods Considered. To evaluate the performance of our cate- 
gorical microaggregation, we have considered 5 alternative masking methods. 
They are Top and Bottom coding. Rank swapping. Global recoding and the 
Post-Randomization method (PRAM). For each masking method, we have con- 
sidered 9 different parameterizations. We briefly describe these methods and the 
parameterizations considered (see [4] for details). 

Top-coding (abbreviated T): This method consists on recoding the last p 
values of a variable into a new category. We have considered p = 1, 2, . . . , 9. 
Bottom-coding (abbr. B): This method is analogous to the previous one but 
recoding the first p values of a variable into the new category. We have 
considered p= 1, 2, . . . , 9. 

Global recoding (abbr. G): This method recodifies some of the categories 
into new ones. In our experiment, we have recorded the p categories with low- 
est frequency into a single one. As before, we have considered p = 1, 2, . . . , 9. 
Post-Randomization method or PRAM (abbr. P): Some values are re- 
placed by other values according to a Markov matrix. In our experiments, 
we have considered the Markov matrix described in [14]. This is, let Ty = 
{Ty {\), . . . , Ty{K)Y be the vector of frequencies of the K categories of vari- 
able V in the original file (without loss of generality, assume Ty{K) = 
minfeTy(fc)), let 9 be such that 0 < 0 < 1, then the PRAM matrix for 
the variable V is defined as: 

_ ( l-eTy{K)/Ty{k) ifl = k 

\9Ty{K)/{{K-l)Ty{k))iflYk 

Let the parameter p he p := 106. For each variable we have built nine 
matrices generated with p taking integer values between 1 and 9. 

Rank Swapping (abbr. R): From an operational point of view, this method 
consists first on ordering the values in ascending order and then replacing 
each ranked value with another ranked value randomly chosen within a re- 
stricted range. For example, the rank of two swapped values cannot differ 
by more than p percent of the total number of records. We consider values 
of p from 1 to 9. 
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The Original File and the Masked Files. We have used a data set extracted 
from the Data Extraction System (DES) of the U. S. Census Bureau [3]. We 
selected data from the American Housing Survey 1993. Variables and records 
were selected as follows: 

Variables selected: BUILT (Year structure was built), DEGREE (long-term 
average degree days), GRADEl (highest school grade), METRO (metropoli- 
tan areas), SCH (schools adequate), SHP (shopping facilities adequate), 
TRANl (principal means of transportation to work), WFUEL (fuel used 
to heat water), WHYMOVE (primary reason for moving), WHYTOH (main 
reason for choice of house), WHYTON (main reason for choosing this neigh- 
borhood). BUILT, DEGREE, GRADEl were considered ordinal variables 
and the others nominal. 

Records selected: We took the first 1000 records from the corresponding data 
file. The number of records is small so that repeated experimentation was 
possible in reasonable time. 

For each file, for each masking method and for each parameterization 5 differ- 
ent experiments have been carried out consisting on considering different subsets 
of variables in the process. This is, we have considered the five subsets of vari- 
ables described in Table 1. The Table also includes the names we have given 
to the sets. Note that the set 2 ; includes only nominal variables, the set o in- 
cludes only ordinal variables and the others consider both nominal and ordinal 
variables. 



Table 1. Subsets of variables considered in the experiments 



Variable 


Type 


\ g m 0 


p ^ 1 


BUILT 


ordinal 


X X \ 


DEGREE 


ordinal 


\x X X 




GRADEl 


ordinal 


X X 


METRO 


nominal 


X X 




SCH 


nominal 


X X 




SHP 


nominal 


X X 




TRANl 


nominal 


X 


A 


WFUEL 


nominal 


X 


A 


WHYMOVE 


nominal 




A 


WHYTOH 


nominal 


X 


A 


WHYTON 


nominal 


X 


A 



4.2 Experiments 

Several parameterizations have been considered for microaggregation. Each pa- 
rameterization consists on several parameters. Some of the parameters refer to 
how variables are selected, others control the partition step and some others 
correspond to the aggregation step. We describe these parameters below divided 
in these three classes. 
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Parameters Concerning Variable Selection: a single parameter has been 

considered for this aspect: 

1) NVar: corresponds to the number of variables to be aggregated together. 
This is for multivariate microaggregation (when several variables have 
to be microaggregated) . In this case, groups of NVar variables are con- 
sidered. 

Parameters Concerning the Partition Step: two parameters are used to 

control our variation of the k-mode algorithm. 

1) K: is the minimum number of records included in a partition. 

2) Nit: refers to the maximum number of iterations allowed in the iterative 
process. 

Parameters Concerning the Aggregation Step: four different parameters 

are used to select the aggregation procedure and to fix it. 

1) Mode?: in the case of categorical variables on ordinal scales, we can 
select among the mode and the median aggregation method. When this 
parameter is set to true, the mode is applied. In the case of nominal 
scales, only the mode operator is allowed. 

2) Convex?: this parameter is to permit to make the frequency function 
convex. When set to true. Equation 1 is applied (instead, when set to 
false, acc'(a) = acc(a)). Recall that making frequencies optional allows 
compensation among small and larger values because, when using the 
median, the aggregation of a large and a small value can lead to some- 
thing in between. This option can only be applied to ordinal variables. 

3) Alpha: this parameter is used to distort the probabilities using the fuzzy 
quantifier Q(x) = Recall that the use of a fuzzy quantifier allows 
to increase the importance of large/central or small values. Again, this 
option can only be applied to ordinal variables. 

4) Random?: instead of applying the median, a random selection is se- 
lected among the categories in the cluster when Random? is set to true. 
The probability of selecting a particular value is proportional to its fre- 
quency. As before, this option can only be applied to ordinal variables. 

For each of the parameters above (except for the number of iterations Nit 
that is fixed to 5) several parameterizations have been considered. In particular, 
we have considered all aggregation methods and the number of variables (NVar) 
and the parameter K between 1 and 9. For the parameter a the values 0.2, 0.4, 
0.6..., 2.0 have been considered. 

The parameterizations considered for the microaggregation together with 
the experiments considered for all the other masking methods resulted in 24525 
different experiments. The computation of all these experiments lasted 4.5 days 
(in a PC at 2GHz, running Red Hat 7.0). 

Table 2 gives the 5 best parameterizations for the 5 sets of variables. It can 
be observed that the best results are obtained for Mode?=false (24 times over 1) 
- this corresponds to the use of the CWOW-median operator, Random?=true 
(23 times over 2), and Convex=false (15 times over 10) and a large number of 
variables (9 variables is the most selected NVar), and K=9. The most frequent 
value for the a parameter is 0.6 (being 0.8 the second one). 
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Table 2. Parameterizations that performed the best for microaggregation 



\Model Random? Convex?\ 


\NVar K 


a 


Set 


Rank 


F 


T 


T 


04 


09 


2.0 


P 


1220 


T 


T 


T 


09 


09 


1.6 


P 


1352 


F 


T 


T 


06 


07 


0.6 


P 


1455 


F 


T 


T 


06 


07 


0.4 


P 


1567 


F 


T 


F 


07 


07 


1.4 


P 


1580 


F 


T 


F 


09 


07 


0.6 


z 


2696 


F 


T 


F 


04 


05 


0.2 


z 


2824 


F 


T 


F 


04 


05 


0.4 


z 


2825 


F 


T 


F 


04 


05 


0.6 


z 


2826 


F 


T 


F 


04 


05 


1.0 


z 


2827 


F 


T 


T 


09 


09 


1.2 


m 


4605 


F 


T 


F 


08 


09 


0.6 


m 


5483 


F 


T 


F 


09 


09 


1.2 


m 


5701 


F 


T 


T 


09 


09 


1.0 


m 


6029 


F 


T 


F 


09 


09 


1.6 


m 


6087 


F 


T 


F 


08 


08 


0.6 


o 


1 


F 


T 


F 


09 


08 


1.0 


o 


2 


F 


F 


F 


04 


09 


0.6 


o 


3 


F 


F 


F 


09 


09 


0.6 


o 


4 


F 


T 


T 


04 


06 


1.6 


o 


5 


F 


T 


T 


09 


06 


1.2 


9 


14506 


F 


T 


F 


09 


07 


1.0 


9 


14563 


F 


T 


T 


09 


04 


1.8 


9 


14585 


F 


T 


T 


09 


04 


1.0 


9 


14596 


F 


T 


F 


08 


05 


1.6 


9 


14601 



This table also gives (column Rank) the position in a global ranking con- 
sidering (method, parameterization, original file). Table 3 gives the two best 
parameterizations for all the other masking methods tested (B, T, G, R, P), 
using each of the selected set of variables. In this table, P(Set) refers to the best 
parameterization obtained for the masking method with the set of variables Set, 
and Rank(Set) refers to the position in the ranking of such parameterization 
for the same set. It can be observed that the best parameterizations in Table 2 
perform better than the parameterizations of the other methods except for the 
set g. 

According to all this, a categorical microaggregation with parameters Mode? 
equal to false and Random? equal to true yields good results. A good set of 
parameters is when the number of variables is large {e.g., NVar=9) and the 
constants K and a are such as: K = 9 and a = 0.6. 

In the case of the set g, the best performance corresponds to the PRAM 
masking method. Note that the set g precisely corresponds to the set with a 
major number of variables. 

Using Table 3, we can see that Rank Swapping can be considered as the sec- 
ond best masking method. We observe that when the set of variables considered 
corresponds to p, m or o, rank swapping has the second best performance. These 
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three sets of variables correspond to the case in which all or most of the variables 
are ordinal. The two other sets analyzed 2 and g, with bad results, correspond 
to either all variables nominal or one ordinal over 7 nominal ones. 

Table 3. For each masking method considered, except for microaggregation, the best 
two parameterizations in terms of the score 



Method 


\P{p) Rank{p)\ 


P{z) 


Rank{z) 


P{m) 


Rank(m) 


P{o) 


Rank{o) 


\P{g) Rank{g)\ 


B 


9 


21548 


1 


16534 


4 


23682 


7 


12996 


1 


23719 


B 


8 


22158 


7 


18828 


3 


23761 


9 


13183 


2 


24503 


T 


9 


17318 


7 


7075 


3 


21936 


6 


13504 


1 


11660 


T 


8 


19744 


6 


7319 


2 


22714 


5 


14367 


2 


12351 


G 


9 


21703 


6 


7277 


4 


22950 


7 


11872 


2 


11233 


G 


8 


22452 


5 


7656 


3 


23400 


8 


12199 


1 


11863 


R 


6 


9567 


1 


9664 


3 


11423 


2 


9797 


2 


19998 


R 


9 


10779 


7 


10917 


7 


12000 


3 


11139 


1 


20109 


P 


4 


23590 


9 


9161 


9 


20945 


9 


21789 


9 


10473 


P 


6 


23604 


5 


9189 


8 


21605 


8 


22187 


6 


11079 



5 Conclusions and Future Work 

The results presented here expand the ones in [5]. In this work, two additional 
masking methods (namely. Rank Swapping and Categorical microaggregation) 
have been added. In that paper, it was concluded that the PRAM performance 
(with the current parameterization) was not good. In this work we have shown 
that for a particular set of variables PRAM yields the best results. These results 
have been obtained for the largest number of variables. Nevertheless, further 
work is needed to confirm the influence of a large number of variables on the 
good performance of PRAM. 

Categorical microaggregation has been shown to have a good performance. 
This method is based on the /c-modes clustering algorithm and either the mode 
or the median for the aggregation step. For each of the sets of variables except 
one, there was a good parameterization that yielded to the best performance. 

Additionally, the analysis described in this paper show that rank swapping is 
the second best masking method for ordinal data. This method was not included 
in the analysis in [5]. 

In this work we have studied masking methods using general information loss 
and disclosure risk measures. This is, not considering particular data uses. The 
analysis of microaggregation from this viewpoint remains as future work. 

As future work we consider the application of fuzzy clustering algorithms in 
the clustering partition step. Recent results on fuzzy clustering are reported in 
e.g. [2,8,13,15]. 
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Abstract. Microaggregation is a well-known technique for data protec- 
tion. It is usually operationally defined in a two-step process: (i) a large 
number of small clusters are built from data and (ii) data are replaced by 
cluster aggregates. In this work we study the use of fuzzy clustering in the 
first step. In particular, we consider standard fuzzy c-means and entropy 
based fuzzy c-means. For both methods, our study includes variable- 
size and non-variable-size variations. The resulting masking methods are 
compared using standard scoring methods. 

Keywords: Privacy preserving data mining, Statistical Disclosure Con- 
trol, Inference Control, Microdata Protection, Microaggregation, Fuzzy 
clustering. 



1 Introduction 

In the last years, there is an increasing demand for microdata among researchers 
and decision makers due to the increasing computer power and the devopment of 
new modeling techniques. Nevertheless, private corporations and governmental 
institutions are not allowed to disseminate data if this allows the disclosure of 
sensitive information about individuals or companies. 

To make data dissemination possible, masking methods have been developed. 
They are to distort the original data so that data is still useful for analysis (data 
vality) but sensitive information cannot be linked to the original respondents. 
At present there exists a large set of masking methods (see e.g. [5,22] for a 
review). Among them, we underline the so-called microaggregation method. This 
method consists on building a large number of clusters with a few elements each 
(at least k elements where A: is a predefined constant) and then replacing the 
original data by the aggregates (or centroids) of the corresponding clusters. In 
other words, instead of publishing the original elements, we publish the centroids 
of the clusters to which these elements belong. Table 1 illustrates this method. 
Four columns in this table correspond to the original file (columns denoted by 
variables vi,V 2 ,V 3 ,Vi) and four columns correspond to the masked file (columns 
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v[,V 2 ,v'^,v'^). In this example, microaggregation has been computed using 
Argus software [10] with k = 3 and considerings groups of two variables (c = 2) 
at a time. 

The rationale of the approach is that considering clusters with only a few (at 
least k) similar elements, the original data is not disclosed (only the aggregates 
are published, but not the original data), anonimity is assured (in the set of at 
least k elements) and as the aggregates are similar to the original data, some 
analysis lead to similar results. 

Table 1. Original (variables vi, V 2 , wa, ^ 4 ) and microaggregated file (n(, fa, fl). 



fl 


V2 


V3 


V4 


v'l 


v'2 


v's 


V 4 


1 


1 


1 


1 


1.66667 


2 


1.33333 


1.66667 


2 


2 


1 


2 


1.66667 


2 


1.33333 


1.66667 


2 


3 


1 


6 


1.66667 


2 


2.33333 


5.66667 


2 


9 


1 


10 


3 


7.33333 


1.66667 


9.66667 


3 


6 


2 


2 


3 


7.33333 


1.33333 


1.66667 


4 


1 


2 


9 


4.33333 


5 


1.66667 


9.66667 


4 


6 


2 


10 


4.33333 


5 


1.66667 


9.66667 


4 


7 


3 


2 


3 


7.33333 


2.33333 


5.66667 


5 


8 


3 


9 


4.33333 


5 


2.33333 


5.66667 


6 


8 


4 


7 


7.66667 


8.66667 


6 


5 


8 


1 


7 


2 


8.66667 


2.66667 


6 


5 


8 


9 


7 


6 


7.66667 


8.66667 


6 


5 


9 


3 


8 


1 


8.66667 


2.66667 


8.66667 


1.33333 


9 


4 


8 


2 


8.66667 


2.66667 


8.66667 


1.33333 


9 


9 


10 


1 


7.66667 


8.66667 


8.66667 


1.33333 



To evaluate masking methods (and possible parameterizations), a score has 
been defined in [6]. There, several masking methods, and several parameteriza- 
tions for each method were ranked according to a mean (a tradeoff) between 
information loss and disclosure risk measures. These measures are to evaluate 
data validity and risk: 

Information loss: evaluates in which extend distortion makes data useless. 
Disclosure risk: measures if masked data can be linked to the original data. 

In that study, microaggregation was ranked the second best method (a score 
of 28.45 over 100 was computed for the best microaggregation parameterization), 
just after Rank-swapping (with a best score of 18.56). The study concluded that 
microaggregation is a suitable method for masking data. Nevertheless, there is 
still some room for improvement. 

In a recent work [7] (see also [8]), we considered the use of fuzzy clustering in 
microaggregation. This approach was introduced with two objectives in mind. 
The first goal was to build more compact clusters. As fuzzy clustering (see e.g. 
[2,16] or [3,9,13,14] for recent results) permits an element to belong to two 
clusters at the same time, the constraint of having at least k elements in each 
cluster can be softened allowing for k elements with partial membership. The 
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second goal was to reduce the disclosure risk associated with the method. Note 
that the masked file for any microaggregation method gives clues about the 
parameters used in the microaggregation process. For example, the masked file 
in Table 1 shows clearly that the parameters k and c were, respectively, fc = 3 and 
c = 2. When these parameters are known, adhoc re-identification methods (to 
link the masked data and the original one) can be defined and, thus, disclosure 
risk increases. Fuzzy clustering permits that an element belongs at the same 
time to two (or more) different clusters. Therefore, when original values of a 
record are replaced several clusters can be used for such selection. Accordingly, 
the final masked file does not give clues on the clusters obtained and specific 
re-identification methods cannot be applied. Therefore, the number re-identified 
records decreases. 

The structure of this work is as follows. In Section 2, a review of some tools 
that are needed in the rest of the paper are reviewed. In particular, we review 
some fuzzy clustering methods and the standard procedure to benchmark mask- 
ing methods. Section 3 is focused on disclosure risk for microaggregation, and 
more particularly, to distance-based record linkage. We prove that adhoc meth- 
ods for linking records can increase disclosure risk significantly (up to 15%). 
Then, Section 4 describes our microaggregation approach, and the experiments 
performed. The paper finishes in Section 5 with some conclusions. 

2 Preliminaries 

In this section we review some tools that are needed latter on in this work. We 
start with a description of some fuzzy clustering methods. Then, we turn into 
the benchmarking procedure for masking methods. 



2.1 Fuzzy Clustering 

Two fuzzy clustering methods are reviewed in this section: fuzzy c-means (FCM) 
and entropy-based fuzzy c-means (EFCM). Both methods are considered with 
two variations: with and without variable-size. In the description, we will use 
the following notation: Xk € for k = 1, . . . ,n stands for the elements (in a 
given p dimensional space) that are to be clustered and Uik is the membership 
of element Xk to the i-th cluster. 

Fuzzy c-means is a fuzzy clustering method that generalizes c-means (also 
known by k-means). While c-means builds a crisp partition with c clusters, fuzzy 
c-means builds a fuzzy one (also with c clusters). Therefore, in the latter ele- 
ments are allowed to belong to more than one cluster. As said, Uik is used to 
formalize the membership of element Xk to the f-cluster. The crisp case cor- 
responds to have Uik as either 0 or 1 (boolean membership) while the fuzzy 
case corresponds to have Uik in [0,1]. In this latter case, Uik = 0 corresponds 
to non-membership and Uik = 1 corresponds to full membership to cluster i. 
Values in-between correspond to partial membership (the largest the value, the 
greatest the membership). 
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Formally speaking, fuzzy c-means is defined as the solution of the following 
minimization problem (see [2] or [19] for details): 

c n 

Jfcm{U, V) = (1) 

i—1 k—1 

subject to the constraints Uik G [0, 1] and YTi=i '^ik = 1 for all k. We will denote 
the values u that satisfy these constraints by M . 

In this formulation, Vi corresponds to the centroid (cluster center/cluster 
representative) of the i-th cluster and m is a parameter (m > 1) that plays a 
central role. With values of m near to 1, solutions tends to be crisp (with the 
particular case that m = 1 corresponds to the crisp c-means). Instead, larger 
values of m yield to clusters with increasing fuzziness in their boundaries. 

To solve this problem, an iterative process is applied. The method interleaves 
two steps. One that estimates the optimal membership functions of elements to 
clusters (when centroides are fixed) and another that estimates the centroids for 
each cluster (when membership functions are fixed) . This iterative process is as 
follows: 



Step 1: Generate an initial U and V 
Step 2: Solve minu^MJ{U,V) computing: 



'^ik — 



/ \ " f W^k — m-lN 1 



Step 3: Solve minvJ{U,V) computing: 

_ J2k=i n{uik)'^Xk 

Step 4: If the solution does not converge, go to step 2; otherwise, stop 



This method does not assure to find the optimal solution of the minimization 
problem given above but a local optimum. Different starting points can lead to 
different solutions. 

Recently an alternative fuzzy clustering method was proposed [15] (see also 
[16]). It is the so-called entropy-based fuzzy c-means (EFCM). While fuzzy c- 
means introduces fuzziness in the solution by means of the parameter m, the 
entropy-based fuzzy c-means uses a term based on entropy and a parameter A 
(A > 0) to force a fuzzy solution. 

Formally speaking, entropy-based fuzzy c-means is defined as follows: 

n c 

Jefcm{U, V) = {EE 1 1 ^ (2) 

k—1 i—1 

subject to the constraints Uik € [0, 1] and = 1 for all k. 
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Note that with this formulation, the smaller the A, the fuzzier the solutions. 
Instead, when A tends to infinity, the second term becomes negligible and the 
algorithm yields to a crisp solution. 

EFCM is also solved using the iterative process described above. In this case, 
the expressions for Uik and Vi in the algorithm above have to be replaced by the 
following computations: 



’^ik — 









Vi = 



E n 

k—1 ^ik^k 



( 3 ) 



( 4 ) 



E n 

k=l 

Being the expressions to minimize 1 and 2 different, both methods lead to 
different solutions. In particular, one of the differences of the approach is that 
the centroid of a cluster has a membership equal to one in the FCM while in 
the FFCM it might have a membership less than one. Also, given a unique set 
of centers, the shape of the clusters (the memberships) would be different due 
to the different expressions for Uik- 

A variation of these clustering methods was introduced [17] to introduce a 
size variable for each cluster. Such parameter was introduced to reduce misclassi- 
fication when there are clusters of different size. Otherwise, two adjacent clusters 
have equal membership function (equal to 0.5) in the mid-point between the two 
centroids. The size of the i-th cluster is represented with the parameter Oj (the 
largest is Oi, the largest is the proportion of elements that belong to the f-th 
cluster). A similar approach is given in [11]. 

The corresponding expressions to minimize for FCM and FFCM when vari- 
able size parameters are considered are as follows: 



JpCMioi: U, V) = Oi ^(tti ^Uik)"^\\xk — Vi\\^} 

i—1 k—1 



JEFCM{a, u, V) = {^2 X/ Uik\\xk - ViW^ + ^Uiklog{a^^Uik)} 

k—1 i—1 

The same constraints above about utk apply here together with a constraint 
about a: J2i=i oti = ^ and > 0 for all i = 1, . . . , c. 

The solution of the minimization process for the variable size problem follows 
the algorithm given above but considering an additional step [Step 3.1] that 
computes an estimate for a. In the case of FCM, a is estimated by: 

_ r / X)fc=i('aifc)™||a:fc ~ t’j] [^ ,^ m-i -i 
and in Step 2 the following expression is used to estimate Uik'- 



'^ik — 







-1 
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The expression for Vi in Step 3 is still valid. 

In the case of variable-size EFCM, the following two expressions are required 
(the expression for Vi given above is still valid): 

E n 

_ fe=i'*tjfc 
iXi — 

n 

2.2 Scoring Clustering Methods 

As said in the introduction, masking methods are ranked according to a trade-off 
between information loss and disclosure risk measures. For each masking method 
and each parameterization a masked file is computed for a particular original 
file. Then, for the pair (original pair, masked file) a score that averages the 
two measures is computed. Here, disclosure risk is measured as the number of 
records in a masked file that can be linked {e.g., using a distance-based record 
linkage algorithm) with the corresponding original record. On the other hand, 
information loss is measured using general indexes (not data-use specific) that 
compare the original data and the masked data {e.g., similarity between records, 
similarity between covariance matrices). See [6] for details of the approach. This 
approach has also been used in [8] and [25]. 

3 Distance-Based Re-identification for Microaggregation 

In this section, we study re-identification methods to show that the standard 
distance-based method (see e.g. [21]) underestimates disclosure risk for microag- 
gregation. Although other approaches exist for re-identification {e.g., probabi- 
listic-based [23,24,12] and clustering-based [1]) we focus on the distance-based 
approach because it was shown in [21] that for numerical data this is the method 
that performs better {i.e., it re-identifies more records). 

Given two files A and B containing records corresponding to the same individ- 
uals and described by the same variables, distance-based record linkage computes 
the distance between all pairs of records (one from file A and the other from file 
B) and assigns to each record in file A the one in i? at a minimum distance. 
Therefore, for a record a in A, the record b in B with d{a, b) = min^ge d{a, b) is 
selected. Here, d{x,y) is the euclidean distance between records x and y. This 
is, 

d{x,y) = 

where dy{xv, y^) is the distance between the values of records x and y for variable 
V {V is the set of variables). Thus, for numerical data dy{xv,yv) = {xy — y^)^ 
and: 

d{x,y) = 
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Nevertheless, it can be shown that in the case of microaggregation, this is 
not the best option. This distance does not exploit prior knowledge on the sets 
of variables used for microaggregation. 

We give now an alternative definition of the distance that takes into account 
the sets of variables. As centroids are computed per set. Then, given a partition S 
of the set of variables, aggregation of per-set distances is considered. Accordingly, 
we define the following per-set distance: 



ds{x,y) 



^ dv : Uv ) 



ves 



Now, as said, the distance between two records is defined as the average (an 
aggregation) of the distance between all the sets (we will denote by ds): 



ds{x,y) 



Y.sesds{x,y) 

1^1 



Combining both expressions we get: 



ds{x,y) = 

We have compared the performance of distance-based record linkage using 
both distances {d and ds) for the outcome of the best parameterization of mi- 
croaggregation following [6] . This corresponds to a microaggregation with sets of 
3 variables and with k = 7. Such microaggregation has ben applied to data from 
the American Housing Survey 1993 (U. S. Census Bureau). See [6] for details. 

Re-identification experiments have been undertaken considering first only the 
first variable in both files, then the first and the second, then the first to the 
third and so on until all variables were considered. Table 2 gives the number and 
proportion of re-identified records for both distances d and ds. 

The table shows that the distance ds behaves better and for half of the 
variables (7 variables) the proportion of re-identifications is increased at least by 
23%. Note that for 1, 2 and 3 variables both approaches yield to the same value. 
This is so because, as the first three variables are microaggregated together, 
both distances are equivalent (in relation to the computation of the minimum 
distance). 



4 Fuzzy c-Means Based Microaggregation 

[8] proposed the so-called “Adaptative at least k fuzzy c-means” for building a 
fuzzy partition where all the clusters had at least k elements. The approach is 
to apply fuzzy clustering with a particular parameterization (c and m in FCM) . 
Then, if the partition leads to clusters with less than k records, the parameters c 
and m are reconsidered and the fuzzy clustering is applied again. Finally, when 
all the clusters had at least k elements, the “Adaptative at least k fuzzy c-means” 
algorithm is applied to all those clusters with more than 2 * k records. 
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Table 2. Number (top) and percentage (bottom) of re-identified records using distance- 
based record linkage for distances d and ds using from 1 to 13 variables in the re- 
identification process. 



ds 

A. 

ds 

d 



1 2 3 4 5 6 7 8 9 10 11 12 13 

~ 3 3 l09 226 237 347 461 487 599 600 636 669 

1 3 3 107 216 224 282 366 386 441 443 472 497 

0.001 0.003 0.003 0.100 0.209 0.219 0.32 0.42 0.45 0.55 0.55 0.58 0.61 

0.001 0.003 0.003 0.099 0.200 0.207 0.26 0.33 0.35 0.40 0.41 0.43 0.46 



Once clusters are built, the values of a record are replaced by the ones of its 
cluster center (this is the so-called assignment process). Nevertheless, a record 
can belong to several clusters and several approaches can be considered for clus- 
ter selection. We have considered the following two random assignment processes: 

(a) All clusters with non-null membership can be selected with the same prob- 
ability. 

(b) Clusters are selected with a probability proportional to the membership 

For each cluster, this process has to be applied to each variable. The following 
two alternatives have been considered: 

(i) Once a cluster is selected, all variables are masked using the same cluster 
center. 

(ii) Cluster selection is applied to each variable. Therefore, different variables 
can use different centers for their masking. 

These alternatives lead to four approaches for replacing the original data by 
the masked one. 



4.1 Experiments 

We have tested our approach using the numerical data described in [6] (consisting 
on 1080 records and 13 variables). This is data from the American Housing 
Survey 1993 (U. S. Census Bureau), publicly available at [4]. We have tested 
our approach using several parameterizations to compare the performance of 
the fuzzy clustering methods described in Section 2. 

For each cluster and parameterization, the method was applied 10 times and 
the average score and the corresponding deviation were computed. This was 
done to avoid differences due to random elements in the method. Table 3 gives 
the score and the deviation of the 10 best parameterizations obtained. It can 
be observed that the best parameterizations yield to scores of about 34, being 
the best one 33.76 (first row). Best solutions correspond to entropy based fuzzy 
clustering with assignment method (b). Of all, the best individual score (not 
considering the average of the 10 executions with the same parameter) is 30.05. 
This corresponds to Entropy-based fuzzy clustering with b.i for the assignment. 
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As a matter of example, for one of the parameterizations (with score 33.63), 
we got an information loss of 32.23 and disclosure risk measures of 15.59 (per- 
centage of re-identified records) and 54.46 (interval disclosure measure). Instead, 
the best microaggregation using the standard procedure leads to a score of 28.45 
with an information loss of 11.06 and disclosure risk measures of 19.35 and 72.34. 
Such information loss measures have been computed using standard procedures 
for. 



Table 3. Parameters and scores for the 10 best mean scores. Here, c is the number of 
variables considered at a time; k is the minimum number of records allowed; ass. meth. 
and dust. meth. are, respectively, the assignment approach and the fuzzy clustering 
method used. Columns 5-14 are the scores obtained for 10 different executions and 
then the minimum, maximum, mean and deviation are given. The last row includes 
the parameterization that has obtained the best score (but not the best mean score). 



c 


k 


ass 


dust 


Min Max Mean Dev| 


Y 


9 


b.h 


Var-EFCM 


31.84 37.18 


33.76 


1.90 


7 


8 


b.h 


EFCM 


31.83 35.35 


33.86 


1.35 


7 


6 


b.i 


EFCM 


31.68 36.96 


34.31 


1.70 


7 


7 


b.i 


EFCM 


31.46 37.92 


34.33 


2.34 


7 


10 b.h 


EFCM 


31.69 37.41 


34.42 


1.67 


7 


6 


b.h 


EFCM 


32.55 36.02 


34.45 


1.12 


7 


7 


b.h 


EFCM 


31.45 43.90 


34.48 


3.48 


7 


8 


b.i 


Var-EFCM 


32.79 37.57 


34.54 


1.81 


6 


8 


b.h 


EFCM 


33.11 36.54 


34.63 


1.09 


6 


10 b.h 


Var-EFCM 


32.12 38.03 


34.64 


1.88 


Y 


9 


b.i 


EFCM 


|30.05 39.23 35.81 2.73| 



To compare the effect of some of the parameters, averages of scores and 
deviations have been computed for each clustering method, each assignment 
approach (z.e., the method for masking a particular record), each number of 
variables considered and each value for k. These averages and deviations permit 
to rank clustering methods and some of their parameters. 

Entropy based clustering algorithms is shown to be significantly better than 
standard fuzzy c-means based clustering algorithms. This is supported by the 
results in Table 4. For each clustering method, this table gives the average score 
of all considered parameterizations. It can be observed that the average score for 
FCM is 50 while for the EFCM is about 40. No significant difference is obtained 
between variable-size and non- variable size. Deviation of the score is smaller for 
entropy based methods. This means that the entropy based approach is more 
stable. 

Table 5 shows that the best method for the aggregation step is the one based 
on a probability distribution proportional to the membership degrees. There is 
no preference in relation to approaches (i) and (ii) as they lead to almost the 



same scores. 
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Table 4. Mean score and deviation for the clustering algorithms considered. 



dust. meth. 


mean score 


deviation 


FCM 


50.95 


7.03 


Var-size FCM 


49.77 


6.70 


EFCM 


41.27 


3.37 


Var-size FCM 


41.68 


3.59 



Table 5. Mean score and deviation for the assignment approaches considered. 



cluster assignment 


mean score 


deviation 


(a) (i) 


47.94 


7.12 


(a) (ii) 


48.16 


7.11 


(b) (i) 


43.76 


3.24 


(b) (ii) 


43.80 


3.22 



Table 6. Mean score and deviation when the number of variables is between 1 and 9. 



number of variables 


mean score 


deviation 


1 


47.62 


6.08 


2 


49.66 


5.28 


3 


45.63 


5.15 


4 


45.28 


4.89 


5 


45.32 


5.25 


6 


43.49 


4.68 


7 


43.07 


4.68 


8 


44.52 


4.94 


9 


47.90 


5.42 



Table 7. Mean score and deviation for k in the range 3-10. 



k valne 


mean score 


deviation 


3 


47.98 


5.99 


4 


46.84 


5.46 


5 


46.42 


5.30 


6 


45.81 


5.20 


7 


45.63 


4.87 


8 


45.48 


5.14 


9 


44.35 


4.32 


10 


44.97 


5.17 



In Table 6, the average scores for different number of variables is given. The 
table shows a weak preference for large number of variables (being 7 the best 
parameterization). The value for k, the minimum number of records in a cluster, 
has a similar behavior. The best parameterization is with k near to 9. This is 
shown in Table 7. 
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5 Conclusions and Future Work 

In this work, we have studied and compared the use of fuzzy clustering for 
microdata protection. We have observed that entropy based fuzzy c-means is 
the most promising approach. 

Our approach has been compared using standard scoring approaches based 
on distance-based record linkage. We have shown that this approach leads to a 
measure that underestimates the risk. Nevertheless, we have used this measure 
to make comparison with other studies possible. 

Under this framework, our approach leads to similar results (28.45 the best 
microaggregation score in [6] and 30.05 the best score here) avoiding, however, 
a clear identification of the clusters. 

As future work, we consider a more detailed study of the score with the 
inclusion of (masking method) specific disclosure risk measures. Also, we consider 
the use fuzzy clustering algorithms using kernel functions (see e.g., [18], [20]). 
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Abstract. Statistical disclosure limitation is widely used by data col- 
lecting institutions to provide safe individual data. However, the choice 
of the disclosure limitation method severely affects the quality of the 
data and limit their use for empirical research. In particular, estimators 
for nonlinear models based on data which are masked by standard dis- 
closure limitation techniques such as blanking or noise addition lead to 
inconsistent parameter estimates. This paper investigates to what ex- 
tent appropriate econometric techniques can obtain parameter estimates 
of the true data generating process, if the data are masked by noise ad- 
dition or blanking. Comparing three different estimators - calibration 
method, the SIMEX method and a semiparametric sample selectivity 
estimator - we produce Monte-Carlo evidence on how the reduction of 
data quality can be minimized by masking. 

Keywords: disclosure limitation, blanking, semi-parametric selection 
models, errors in variables in nonlinear models 



1 Introduction 

Over the last decades empirical research in the social sciences showed an in- 
creasing interest in the analysis of microdata. While the focus was originally 
more on individual and household data, a growing share of social scientists also 
used firm-level data. Both types of data - individual and household data as 
well as firm level data - contain sensitive information on the observational unit, 
whose confidentiality has to be protected against disclosure in the interest of 
the observational unit, but also in the interest of the data collecting institution. 
Obviously, for the data collecting institution, this creates a trade-off between the 
goal of guaranteeing confidentiality and the provision of providing the maximum 
amount of information to the researcher. Private and public data collecting in- 
stitutions therefore become interested in the provision of scientific-use-files that 
optimally combine the interests of the survey respondents, the data collecting 
institution and academic users. 
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In order to minimize the probability of disclosing individual information, 
statistical offices and other data collecting institutions apply various masking 
procedures. They differ in the probability of reidentification of the original in- 
formation and in their effects on the efficiency and consistency of econometric 
estimators. In general a masking procedure simply represents a data filter, that 
transforms the true data generating process. Therefore, the primary question 
of the empirical researcher is whether the true data generating process can be 
reidentified despite the filtration through the masking procedure. Finally, disclo- 
sure limitation confronts the empirical researcher with the fundamental question 
about the extent of the information loss by disclosure limitation and whether 
consistent estimation is possible at all. Even if disclosure limitation does not lead 
to inconsistent parameter estimates by the use of appropriate estimation tech- 
niques, the question about the relevance of the empirical findings remains open. 
Causalities or correlations between covariates that are found to be insignificant 
on the basis of the masked data, may simply be the result of an efficiency loss. 

This paper takes a closer look at the consequences of masking procedures on 
the performance of nonlinear econometric estimators. While the effects of some 
masking procedures such as listwise microaggregation, addition of independent 
noise and aggregation of qualitative information on the properties of estima- 
tors are well understood for the linear model^ where possible cures can easily 
be implemented to reinstall consistency of the parameter estimates, nonlinear 
regression techniques coping with implications of data masking are still at their 
infancy. 

One popular method of masking microdata is to add independent noise to 
the covariates. For the linear regression model the effects of measurement errors 
are well understood and discussed in the literature on errors-in-variables (EIV) 
models (e.g. [Ful87]). Literature on measurement errors in nonlinear models is 
comparatively limited. Special aspects are treated by [Ame85], [HNP95], [LS95], 
and [HT02]. The monograph by [CRS95] surveys various approaches to errors- 
in-variables in nonlinear models. 

In this paper we analyze the performance of the calibration method proposed 
by [CS90] and the simulation-extrapolation method (SIMEX) by [CS94] for non- 
linear models with measurement errors. Although originally developed for the 
general case of errors in variables in nonlinear regression settings, these methods 
turn out to be particularly suited for the case of data masking through noise ad- 
dition. We show, how these methods have to be modified for the application to 
data that are masked by the addition of independent noise as means of statistical 
disclosure limitation. 

Data blanking is an alternative easy-to-implement method of disclosure lim- 
itation. Because observations in the tails of the sampling distribution reveal the 
highest risk of disclosure (large firms, firms with only a few competitors in their 
sector etc.) observations in lower or higher quantiles are blanked. This leads to 
sample selectivity problems if the blanking rule is related to the dependent vari- 

^ See, for instance, [LP03] who compare various approaches taking into account mask- 
ing by listwise microaggregation or noise addition. 
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able. We therefore consider the use of a selectivity approach, that can also be 
applied to nonlinear regression models. The estimator we propose consists of the 
Klein-Spady semiparametric estimator for binary choice models ([KS93]) for the 
first estimation stage combined with the series estimator proposed by [New99] 
for the second stage. For two reasons this approach seems to be superior to 
the imputation methods. Imputation of the blanked values, if it is precise, may 
again create a high risk of disclosure. Secondly, the data generating process of 
a nonlinear regression model, where observations (a subset of observations) are 
imputed can hardly be modelled correctly. 

The sequel of the paper is organized as follows. In Section 2 we introduce 
the calibration method and the SIMEX method for the general measurement 
error problem in nonlinear models. We adjust the calibration method originally 
designed for the case of replicated data to the case of data perturbation by 
measurement errors. Moreover, we show that the SIMEX method is particularly 
suitable for the case of disclosure limitation by data perturbation. Using a Monte- 
Carlo design for data which are masked by the addition of independent noise, we 
investigate the small sample properties of the two estimators applied to a count 
data example. Section 3 deals with the possibility of estimating the parameters 
of the true data generating process if the data are masked by blanking. Based on 
a Monte-Carlo design for blanked data we try to assess the performance of the 
two step selection estimate if a subset of the data is blanked. Section 4 concludes 
and gives an outlook on future research. 

2 Masking by Noise Addition 

A simple method to protect data against disclosure is to add independent noise 
to the covariates. This leads to a standard errors-in-variables problem. However, 
contrary to the case of non-experimental measurement errors more information 
on the error process, such as information on the moments or the distribution, 
can be made available to the researcher without significantly increasing the risk 
of disclosure. This additional information eases the implementation of nonlinear 
estimators accounting for measurement errors. 

Let us consider the following nonlinear regression model of the form: 

Yi = m{Xi, 13) + £i ( 1 ) 

where Yi is the dependent variable, is a fc dimensional vector of infeasible 
explanatory variables, and £i an independent error term with mean E = 0 

and V [£i|Ai] = cr‘^(Xi). The true X’s are masked by an error, so that we can only 
observe the vector Wi. Here we will focus on the case of additive measurement 
errors 

Wi = X, + Ui, ( 2 ) 

where Ui is a vector of independent random variables with E = 0 and 

Y[u,\Xi\ = Since, in the presence of measurement errors, the error term 
and the regressors are correlated, the OLS estimator for (3 in the linear model is 
inconsistent. In general, this also holds for any nonlinear regression models. 
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Regression Calibration. The basic idea of regression calibration is strongly re- 
lated to two-stage least squares regression. Similar to two-stage least squares the 
vector of explanatory variables Xi including sensitive information on individual 
i is replaced by an estimate of the orthogonal projection of X on W, P{X\W). 
After this approximation, a standard analysis is performed. This method was 
first studied in the context of proportional hazard regressions ([Pre82]) and ex- 
tended to more general classes of regression models ([CS90], [RWS89]). A detailed 
discussion of regression calibration can be found in [CRS95]. 

Contrary to two-stage least squares, a regression of A on IT is not feasible 
since X is not available for the researcher. In the case of replication data, e.g. 
panel information on W or replications of W, the components of the linear pre- 
dictor can be estimated without knowledge of A. If instruments Z are available 
to predict A, then an orthogonal projection of W on Z, P{W\Z) can be used 
instead of P{X\Z), since the estimates of the two orthogonal projectors are the 
same. Then the unobserved A are replaced by its estimate from the calibration 
model in a standard analysis to obtain parameter estimates. Finally, the stan- 
dard errors are adjusted to account for the estimation of the unknown covariates 
using either the bootstrap or the sandwich method. 

For replication data, the estimate of the best linear prediction (BLP) can be 
obtained as follows: Suppose that there are ki replicate measurements of Aj, and 
that Wi, is their mean. With replication data, the measurement error variance 
can be estimated by 

w n ki 

E - m' 

We know that the BLP of X given W is 

P[X\W] = E[A] -hCov [IT, A] V [IT]”^ (IT-E [1T]), 
which can be estimated as 

P[X\W] = A. + A,,A^V(IT - A»). 

In order to compute the estimate of BLP, we use for the mean of the unknown 
covariates the mean of the replicates (As = A™)- Moreover, we estimate the 
variance-covariance matrix 

1 ^ , n-1 ^ 

Z'wx — ^xx — ^ [ {ki{Wi_ Z^uu 

^ i=l 

SIMEX. Alternatively we consider the SIMEX method, which is well suited 
to the problem of additive measurement errors. SIMEX is a simulation-based 
method of estimating and reducing the bias due to measurement error. This 
method rests on the observation that the bias in the parameter estimates vary 
in a systematic way with the magnitude of the measurement error. It exploits 
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the information from an incremental addition of measurement errors to the Vh’s 
using computer simulated random errors. The computation of the parameter 
estimates for different noise impacts yields the information in which direction 
the measurement error affects the bias of the parameter estimates. This stage is 
called the simulation step. The extrapolation step models the relation between 
the parameter estimates and the magnitude of the measurement errors. In a final 
step, this relation is extrapolated to the case of no measurement errors. 

Suppose that the measurement error is additive 

VVi = Xi + Ui, 



and 

W^,b{6) = Wi + V9au,^b, b=l,...B, 

where 6 > 0, and {ui^b}b=i are iid computer simulated standard normal covari- 
ates. Note that this simulation step creates additional datasets with an increas- 
ingly large variance of the observable which is equal to (1 -I- Let 

(3b{Sj) denote the vector of parameters estimates obtained by regression of Y 
on {Wb{0j)} ioT 0 < Oq < 01 < ... < 0j- The value 0j = 2 is recommended 
([CRS95]). 

Given the B bootstrap estimates for each 0j we can compute an average 
estimate /3{0j) = g This is the simulation step of SIMEX. Each 

component of the vector (3{0j) is then modelled as a function of d for 0 > 0, 
and the SIMEX estimator is the extrapolation of (3{0j) to !3{0j = —1), which 
represents a measurement error variance of zero. 

We follow the suggestion of [CS94] and use a quadratic function as our ex- 
trapolation function: 

/3(0) = cti -|- 0,20 T Cl30^ ■ 

This gives a system of equations which is solved for oi, 02 and 03, and finally 
evaluated at the point 0 = —1, to obtain the SIMEX estimator. 

This method can be implemented in practice without additional expenditure 
for the data collecting institutions, and without an increase in the disclosure risk. 
The data collecting institutions only have to provide the variance-covariance 
matrix to the data user. Given that the disclosure risk does not increase 
with the assumption of uncorrelated measurement error terms, uncorrelatedness 
can be assumed, such that the information about the variances on the error 
terms are sufficient to get a consistent estimate of the parameters of interest. 



Monte Carlo Evidence. The Monte Garlo experiment illustrates the quanti- 
tative effect of measurement errors in nonlinear models. We generate a poisson 
distributed dependent variable with true conditional mean function 



A — exp(/?o + PiXi + ( 32 X 2 ). 
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Moreover we suppose that the two explanatory variables Xi and X 2 depend on 
a linear combination of Z, multivariate normal distributed, validation data: 
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— 701 + 711^1 + 712^2 + 713^3 + £1 
X2 = 702 + 721^1 + 722-^2 + 723-^3 + ^2 

We introduce the correlation between the X-variables and the Z-variable 
for two reasons. First, it creates more variability in the explanatory variables. 
Secondly, the Z’s can serve as instruments for the calibration method. To protect 
the data against disclosure we disturb both explanatory variables Xi and X 2 by 
an additive error u, which is iid normally distributed with mean 0 and variance 
equal to 0.25. The error terms £ and the measurement error u follow a trivariate 
normal distribution: 





1 


o' 




0.4 0 0 ■ 


\ 


£2 




0 


•> 


0 0.4 0 




u 


V 


0 




0 0 0.25 





For each simulation we suppose that the true values for f3 are /3o = 0.5, /3i = 1 
and (32 = —1. For 7 the true values are 701 = 711 = 712 = 713 = 702 = 721 = 
722 = 723 = 1- Our Monte-Carlo results are based on three different samples of 
size N = 100, N = 1000 and N = 3500, which are replicated R = 1000 times. 

For the SIMEX approach, we suppose that the 9j's are equidistant in the 
interval [0,2], so that 0 = < ^1 = 0.5 < 02 = 1 < ^*3 = 1-5 < 6*4 = 2, and 

generate for each value of 0, B=50 samples. 

Table 1 contains the results of the Monte Carlo simulations for the sample 
sizes N = 100, 1000 and 3500, for the regression calibration estimator (Prcal) 
and for the SIMEX estimator (Psimex)- In order to have a benchmark how 
strongly the measurement errors bias the estimates we also report the Poisson 
ML-estimates Pnaive on the mismeasured data and the Poisson ML estimate 
for the undisclosed explanatory variables /3. The Poisson estimates on the original 
data provide insights on how close our estimates come to the best case scenario. 
For the small sample size N = 100 the bias is somewhat larger than the bias of 
the ML estimates when the EIV problem is ignored. However, the bias decreases 
significantly with larger sample sizes. We can note that the bias of the regression 
calibration estimator is much higher than the one for SIMEX estimator. This is 
the case for all three sample sizes. The root mean squared error is not far from 
the corresponding ML estimator of the original data for the SIMEX estimator, 
but not for the regression calibration estimator. 

The relative standard error, RELSE, is defined as the ratio of the average 
standard error of the estimator over the number of completed MC replications 
to the empirical standard deviation of the estimator. Indeed when the number 
of replications tends to infinity, the standard error of the estimates converges to 
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the true standard error, for a finite N. A deviation of RELSE from 1 provides 

information about the accuracy of the estimation of the standard error based on 

the asymptotic distribution. In small samples the standard errors of the estimates 

cannot be estimated with great precision for the regression calibration estimator 

and for the SIMEX estimator. But with a medium sample size {N = 1000) the 

relative estimated standard error comes close to its desired value 1. 

Overall we can conclude that for moderate and larger sample sizes the quality 

of the SIMEX estimates comes close to the best case estimates for the original 

^2 

data. The reliability ratio p = ^ is equal to 0.93 in this case. In order to compare 
the consistency of the methods, we will reestimate these models by introducing a 
smaller value for the reliability ratio in each new Monte Carlo experiment. This 
means that we will consider measurement error models with increasing value for 
the variance of the measurement error. This will be done in a future version of 
this paper. 

Table 1. MC-Results for the Estimation of Measurement Error Model. Estimates refer 
to the coefficient on Xi, number of replications = 1000. 



di = 1 mean bias RMSE RELSE 
N = 100 





1.003 .003 .081 


.907 


Pnaive 1.004 .004 .085 


.906 


Prcal 


.847 -.153 .507 


.827 


PsiMEX 


.946 -.054 .096 


1.35 




N = 1000 




P 


.999 -.001 .024 


.961 


Pnaive 


.998 -.002 .025 


.954 


Prcal 


.940 -.060 .461 


1.082 


PsiMEX 


.991 -.009 .092 


1.031 




N = 3500 




P 


1.000 .000 .012 


.998 


Pnaive 


1.001 .001 .015 


.996 


Prcal 


.987 -.013 .450 


1.052 


PsiMBX 


1.002 .002 .012 


1.021 



3 Masking by Blanking 

In the following we propose an approach that can be applied to data that are 
masked by blanking. Blanking as means of disclosure limitation rests on the 
idea, that covariates of observations which are subject to a high risk of disclosure 
should be censored or eliminated completely from the data set. This obviously 
creates a sample selection problem, since observations on the dependent variable 
are nonrandomly dropped from the original sample. However, if it is possible to 
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capture the selection mechanism by a control function correctly, the parameters 
of the true data generating process can be estimated consistently. As before we 
consider the estimation of the model given by (1). 

Without loss of generality we assume that the disclosure risk is beyond some 
unacceptable level if an item takes on an extreme value defined by an upper and 
a lower bound (e.g. a lower and a upper quantile). Let Wi be the vector of L 
covariates of observation i. This vector contains the explanatory variables X^, 
the dependent variable Vi, as well as other variables which can be used for this 
disclosure of the individual respondent. An observation will not be blanked, if 
all covariates of Wi are within the range of quantile 0i and Hence, the binary 
indicator for the selection into the publicly available data set is defined as: 



Si = 



1, if qe, (Wi„ . . . , Wnj) < Wij < (Wy, . . . , Wn^), Vj = 1, . . . , L 
0, else, ^ ^ 



where qg (•) defines the 6*— quantile of variable Wj with 9i < 6u- Usually the risk 
of disclosure is particularly high only for large values of Wj , such that blanking 
of values below 6i can be neglected. Alternatively, one may think of a selection 
rule that is based on a combination of covariates, which may lead to a high risk 
of disclosure. 

In the following we assume that the selection rule can be approximated by 
the semiparametric single-index form: 



Si ^ {(p{Z^'f)>T} 1 ( 4 ) 

where ip{-) denotes a twice differentiable unknown function of the index function 
li = and r an unknown threshold parameter. Equation (4) represents a gen- 
eralization of the standard linear selection rule Si = i ^z'j+ui>o}- The explana- 
tory variables of the selection equation can be either explanatory variables of the 
structural equation (1) or other covariates which effect Si via cross-correlations. 
Since the explanatory variables of the selection equation may also be subject 
to blanking, we suggest the following transformation, if blanking is only due to 
large values of covariates: 

Z*j = + (l “ ^{Zij<qeu{Zj)} E [Zij\Zij > qgu{Zj)]) (5) 

With this transformation the observations not critical to disclosure remain in 
the data set in its original form, while the other observations will be replaced by 
its conditional mean, which can be estimated by the corresponding conditional 
arithmetic mean. Finally, the selection equation can also contain variables that 
are not based on a selection rule such as (3). For instance, it is common to 
blank information on firms which belong to sectors with only a few members. 
Hence the number of firms in a sector can be used as a covariate in the selection 
equation. 

Let n < TV be the number of observations not subject to blanking. For these 
observations the following conditional population regression function holds: 

E [YMZ'l),Si = 1] = m{Xi, /3) + E [eMZ'a),Si = 1] 

= m{Xi, (3) + A(Z'7) -I- Q 



( 6 ) 
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where A(-) is a general selection control function and Q = £i— is a heteroscedas- 
tic error term with Si = 1] = 0. The identification conditions for 

this semiparametric model with a linear regression function are discussed in de- 
tail by [New99]. The problem of finding adequate exclusion restrictions for the 
case of data blanking is significantly less severe than for typical applications of 
sample selectivity correction models based on the selection on unobservables. In 
our case the overidentifying restrictions naturally arise from covariates that are 
not part of the structural equation but affect the probability of blanking^. 

The parameters of the structural equation augmented by the semiparametric 
selection control function will be estimated by a two-step procedure similar to 
Heckman’s two-step estimator. In the first stage the parameters of the selec- 
tion equation can be estimated on the basis of a semiparametric binary choice 
model. Here we use the semiparametric efficient, asymptotically normally dis- 
tributed and -\/]V-consistent estimator proposed by [KS93]. Obviously alternative 
■\/]V-consistent estimators are feasible, such as the semi-nonparametric likelihood 
approach by [GLL93] or the semiparametric moment estimator by [Ich93]. For 
the second stage of estimation we use Newey’s [New99] semiparametric estima- 
tor, which uses a general series approximation for the selection control function. 
This estimator, originally proposed for the linear model, can easily be extended 
to models with nonlinear mean functions. Newey’s second stage estimator is par- 
ticularly suitable in this context since the series approximation is able to capture 
nonlinearities in the control function which arise from the heteroscedasticity of 
the error term. 

The Klein-Spady estimator can be regarded as a parametric likelihood ap- 
proach leaving the choice probability unspecified: 

N 

lnT( 7 ) = ^ lnP[5. = l\Z'a] + (1 - Si) ln[l - P[Si = \\Z'a]] (7) 

2 = 1 



Using Bayes’ theorem the choice probability can be reformulated as 



p{Si = i\z'a) 



P{S, = l)gi\s=i{Z'^\S^ = 1) 



(8) 



where gi denotes the density of the index li = Z'y and gi\s=i is the density 
conditional on = 1. An estimate of the choice probability (8) can be obtained 
by estimating all terms on the right-hand-side nonparametrically. We obtain 
estimates for the two densities by kernel density estimation, while P{Si = 1) is 
replaced by its arithmetic mean. Replacing the choice probability by its estimate 
yields the quasi-likelihood function: 



N 

maxlng( 7 ) = V^an((P[^, = l\Z'j])^)+{l-S,)ln{{l-P[S, = l\Z[^])^), (9) 

_y 

^ Note that due to the semiparametric form, the intercept of the structural equation 
cannot be identified without additional assumptions, see [AS98]. 
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where squaring the outcome probabilities is necessary to take into account esti- 
mates of the probabilities beyond the 0-1 interval. 

For the second stage of estimation Newey suggests to replace the unknown 
control function A(-) by a linear combination of J basis functions pj: 

j 

( 10 ) 

i=i 

where for J — >■ oo the approximation error vanishes. The nj’s are unknown 
coefficients to be estimated. The basis functions depend on the index function 
only. If we replace A by its approximations (10) we obtain 

.7 

y* = m{Xi,P) -h rjjPj (y'y) -k = Y Pj (Pj ~ Pj) + (H) 

7 = 1 

The coefficients /3 and pj can be estimated by nonlinear least squares. The op- 
timal order of J can be chosen by the minimization procedure described in 
Appendix 4. For the basis functions Newey suggests the following form: 

p,{Z'j) = [<F(Z'7)]^ (12) 

where S' is a monotone function restricted to the interval [—1, 1], for example 

nzh) = mzh) - 1 ^ 

Monte Carlo Evidence. For our Monte Carlo study we blank the original data 
using (3). While the regressors of the selection equation are masked according 
to (5). 

In a first step we look at the small sample performance given a linear model 
of the form'^: 

Y = /?o -l- f 3 iXi 132X2 s 

The error term is independently t-distributed with 4 degrees of freedom, thus 
V [e] = 2. The two explanatory variables are drawn from a bivariate normal 
distribution. The true parameter values are (3q = .5, (3i = 1 and (32 = —1. Based 
on a Monte-Carlo design for stochastic regressors and a sample size of = 100, 
1000 and 3500 we use R = 1000 replications for the evaluation. As a benchmark, 
we also compute the least squares estimator (3 for the original sample in order 
to see how strongly blanking affects the empirical results. 

The explanatory variables as well as the additional instruments for the selec- 
tion equation are drawn from the multivariate normal distribution of the form 
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® More details on the Newey procedure can be found in Appendix 4. 

In a later version of this paper we will also present the results for the nonlinear mean 
function used in the previous section. 
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where Xi = Zi and X 2 = Z 2 - An observation i is deleted from the data set 
{Si = 0), if at least one of the covariates of Wi = (Yi, Zu, . . . , Z 4 i) is larger then 
the 90 percent quantile in the sample. 



S^ = 



1. if ^ {Yi<q,go(Y)} ■ rij = l < Q.9o{Zj)} = 1 

0, else. 



Table 2 contains the results for the Monte-Carlo simulations for the sample 
sizes N = 100, 1000 and 3500. Due to banking the estimation results of the 
second stage rely on smaller sample sizes. In terms of the sample selection bias 
only the observations that are deleted because the dependent variable is larger 
than the 90 percent quantile are relevant. Deletion of observations due to a 
blanking of one or more explanatory variables only leads to a loss of efficiency. 
In Table 2 h denotes the average sample size used in the second stage, while fiz 
refers to the average number of observations that could be used, if blanking is 
only due to disclosure risk of the explanatory variables. 

For the small sample size of fV = 100 (n = 60.14) the bias is somewhat 
larger than the bias of the least squares estimates. However, the bias decreases 
significantly with larger sample sizes. Although somewhat larger, the root mean 
squared error of the semiparametric two-step estimator is not far from the fig- 
ure of the corresponding least squares estimates based on the original data. In 
small samples the standard errors of the estimates are not estimated with great 
precision. But with a medium sample size (N = 1000, n = 628.01) the relative 
estimated standard error comes close to its desired value 1. Note that the com- 
paratively good performance of the semiparametric estimator in terms of bias 
avoidance is a consequence of our specific Monte-Carlo design. The selection 
problem in our Monte-Carlo study solely arises from the blanking of the obser- 
vations on the dependent variable in the upper 90-percent quantile. If there is a 
correlation between the Z- variables and the error term e: the selectivity problem 
would be more severe. 



Table 2. MC-Results for the Estimation of the Semiparametric Selectivity Model. 
Estimates refer to the coefficient on Xi, number of replications = 1000. 



d = 1 MEAN BIAS RMSE RELSE 
N = 100 (h = 60.14, nz = 66.54) 
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4 Conclusions 

In this paper we analyze the problem of estimating nonlinear regression models 
when data are used that have been protected against disclosure. We can show 
that if data are protected by noise addition, the SIMEX method is a powerful 
tool to remove the bias that results from measurement errors. The performance 
of the computationally less burdensome calibration method is somewhat weaker. 
This method seems to be a reasonable estimation procedure if the perturbation 
by noise addition is comparatively small and samples are large. 

The inconsistency of parameter estimates by blanking, as an alternative 
method of disclosure limitation, can be removed satisfactorily by the two-step 
sample selection correction method introduced in this study. However, blanking 
may lead to a large loss of observations, particularly if many variables in the 
dataset can be used to disclose individual information. Our study has shown 
that the bias generated by disclosure limitation techniques can be removed if 
additional information on the set-up of the method (truncation rule, variance 
of the measurement error etc.) are provided. In this case, effective disclosure 
limitation and consistent parameter estimation may not be contradictory. 

A final conclusion of the relative ability of the different econometric ap- 
proaches to estimate the parameters of the true data generating process, if data 
have been protected against disclosure by blanking or noise addition is some- 
what to premature. More Monte-Carlo evidence on various popular nonlinear 
regression models such as probit and Tobit is needed. In the case of blanking 
alternative specifications of the selection mechanism, which differ from the linear 
selection equation should be considered. 
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Appendix 

VC-Matrix of the Newey-Series Estimator 

Let Wi = {X[,pii,pi2, . . . ,p.ji)', 
W = {Wi,W2 . . . ,Wn)' , 



The optimal number of bases functions minimizes the following function 



JoPT = argmin CV (J) = ^ 



(l - 2 ,5,) 



1 - & 
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where 5i = ^ W' and ei = Yi~ W- 6. 



Vns = [4, 0 ] («)■' A [h 0 ] ' 

A = IT. mm (Yi-mo)^ +HV ( 7 ) h' 
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d 7^ 
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where Ik is the identity matrix with a dimension equal to the number of ex- 
planatory variables in the structural equation. V ( 7 ) is a consistent estimate of 
the variance covariance-matrix of the first stage estimator. 
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Abstract. Masking methods protect data sets against disclosure by per- 
turbing the original values before publication. Masking causes some in- 
formation loss (masked data are not exactly the same as original data) 
and does not completely suppress the risk of disclosure for the individu- 
als behind the data set. Information loss can be measured by observing 
the differences between original and masked data while disclosure risk 
can be measured by means of record linkage and confidentiality intervals. 
Outliers in the original data set are particularly difficult to protect, as 
they correspond to extreme inviduals who stand out from the rest. The 
objective of our work is to compare, for different masking methods, the 
information loss and disclosure risk related to outliers. In this way, the 
protection level offered by different masking methods to extreme indi- 
viduals can be evaluated. 

Keywords: Statistical database protection. Statistical disclosure con- 
trol, Outliers, Masking methods. 



1 Introduction 

Publication of statistical data sets can lead to disclosure of confidential infor- 
mation related to the individual respondents from whom the data have been 
collected. Several proposals to measure disclosure risk have been proposed in 
the literature. Measures for disclosure risk are divided into two groups: those 
designed for categorical data and those designed for continuous data. 

Our work focuses on continuous data masked using perturbative methods 
before publication. Specifically, we concentrate on how statistical disclosure con- 
trol (SDC) masking methods protect outliers. It must be noted that extreme 
individuals are particularly easy to identify; thus their disclosure risk is higher. 

* This work was partly supported by the European Commission under project “CASC” 
(IST-2000-25069) and by the Spanish Ministry of Science and Technology and the 
FEDER fund under project “STREAMOBILE” (TIC-2001-0633-C03-01). 
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Our work analyzes, for each masking method, the information loss and disclosure 
risk of extreme records with respect to the average results for the overall data 
set. Our objective is not to make an exhaustive study on masking methods, but 
to gain some understanding on how masking affects outliers. 

Disclosure risk measures are based on distance-based record linkage and the 
construction of confidentiality intervals. 

The masking methods compared are: resampling, JPEG lossy compression, 
multivariate microaggregation, additive noise and rankswapping. Our measures 
have been obtained from the experimental study of the masked versions of two 
data sets. 

Section 2 lists the masking methods which have been studied in this work. 
Section 3 specifies the measures used to compute information loss and disclosure 
risk. The two data sets used to obtain measures are described in Section 4. 
Section 5 reports on the results obtained for the different masking methods 
applied to each of the two data sets, i.e. the information loss and disclosure risk 
of outliers with respect to the overall data set. In this way, we determine which 
masking methods offer the same protection to outliers and average individuals 
and which masking methods offer a lower protection to outliers. Conclusions are 
summarized in Section 6. 



2 Masking Methods for Continuous Microdata 

Several masking methods for protection of statistical data have been presented 
in the literature. See [10,5,11,1] for a survey of such methods. The masking 
methods considered in this paper are: 

— JPEG: This method is based on the JPEG [7] lossy image compression 
algorithm. The idea is to take original data values (properly scaled) as pix- 
els of a digital image and apply the JPEG algorithm. The JPEG algorithm 
takes a parameter p (between 1, maximum compression, and 100, minimum 
compression) which determines the compression level. The masked data set 
corresponds to the lossy reconstructed data after JPEG compression (prop- 
erly unsealed). Our experiments have taken p from 10 to 100 in steps of 
5. 

— Rank swapping: This methods perturbs a different variable at each step. It 
ranks the values of the variable and randomly swaps values that are within a 
certain maximum rank difference [9] . The maximum rank difference between 
two swappable values is specified by parameter p, which is expressed as a 
percent of the total number of records. Our experiments have taken p from 
1 to 20 in steps of 1. 

— Additive noise: This method perturbs a different variable at each step. 
Each variable is perturbed by the addition of a random Gaussian value with 
0 mean and standard deviation p- s, where s is the standard deviation of the 
variable and p is a parameter [8]. Our experiments have taken p from 0.02 
to 0.2 in steps of 0.02. 
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— Resampling: For each variable of the masked data set, values are taken 
through random sampling with replacement from the values in the original 
data set. This is done n times for each variable and then, after sorting the 
extracted samples, the average of the n sampled values at each position 
is computed. At the end, we take each variable in turn and re-order the 
obtained averages by the order of original data. We have taken n from 1 to 
3 in our experiments. 

— Microaggregation: Taking all variables at a time, groups of k records are 
built so that each group is formed by records with the maximum similarity 
between them [3] . The average of each group is published in the masked data 
set. Our experiments have been performed by taking values for k from 3 to 
20 in steps of 1. 

3 Measures for Information Loss and Disclosure Risk 

Any work aimed at the comparison of different masking methods requires quan- 
titative measures to obtain results. These measures are divided into two groups: 
those measuring information loss and those measuring disclosure risk. 

3.1 Information Loss Measures 

In [4] several information loss measures are proposed. Five of these measures 
were used in that paper to construct a benchmark. Some of those measures 
compare original and masked data directly and some compare statistics on both 
data sets. Measures targeted at comparison of statistics are not useful in our 
work as they refer to the overall data set and do not provide information on 
individual records, and particularly on outliers. 

Let X = {xij} be the original data set and X' = {a;F} a masked version of 
X. Both data sets consist of n records of d variables each. Our work uses two 
measures: 

— ILl: This measure from [4] computes the mean variation between the original 
and the perturbed version of a record i: 

~ ^ij\ 

We must take into account that, if the j-th variable in the i-th original record 
has a value Xij = 0, the aforementioned measure will result in a “division 
by 0” error. In this case, we replace Xij by in the formula. If both Xij 
and x[j are 0, the j-th variable is excluded from the computation for the 
t-th record (since there are no changes in that variable). In ILl, the effect 
of absolute variation depends on the distance between Xij and 0. ILl is 
greater for variables near 0. The measure I Lis proposed in [11] overcomes 
this drawback. 
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/Lis: Given the original and the perturbed version of a record i 

d I „ I 



/Lls = i V 



V2S. 



where Sj is the standard deviation of the j-th variable in the original data 
set. 



3.2 Disclosure Risk Measures 

The disclosure risk measures used in our work were proposed in [4]. These mea- 
sures evaluate: i) the risk that an intruder having additional information can 
link a masked record with the corresponding record in the original data set; ii) 
the risk that the values of an original record can be accurately estimated from 
the published masked records. The first kind of risk is evaluated through record 
linkage and the second kind through the construction of confidentiality intervals. 

— Record linkage: This measure is based on the assumption that an intruder 
has additional information (disclosure scenarios) so that she can link the 
masked record of an individual to its original version. There exist several 
techniques for record linkage, such as probabilistic [6] and distance-based. 
Our work uses the distance-based technique. In this technique, linkage pro- 
ceeds by computing the distances between records in the original and masked 
data sets. The distances used are standardized to avoid scaling problems. For 
each record in the masked data set, the distance to every record in the orig- 
inal data set is computed. A record in the masked data set is labeled as 
correctly linked when the nearest record in the original data set turns out 
to be the corresponding original record. The percentage DLD of correctly 
linked records is a measure of disclosure risk. 

— Interval disclosure: Given the value of a masked variable, we check whether 
the corresponding original value falls within an interval centered on the 
masked value. The width of the interval is based on the rank of the vari- 
able or on its standard deviation. For data without outliers, using the rank 
or the standard deviation yields similar results; in the presence of outliers, 
both ways of determining the interval width are different and complemen- 
tary. 

• Rank-based intervals: For a record in the masked data set, compute 
rank intervals as follows. Each variable is independently ranked and a 
rank interval is defined around the value the variable takes for each 
record. The ranks of values within the interval for a variable around 
record r should differ less than p percent of the total number of records 
and the rank in the center of the interval should correspond to the value 
of the variable in record r. Then, the proportion of original values that 
fall into the interval centered around their corresponding masked value is 
a measure RID of disclosure risk. A 100 percent proportion means that 
an attacker is completely sure that the original value lies in the interval 
around the masked value (interval disclosure). 
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• Standard deviation-based intervals: Intervals are built around val- 
ues of the masked variables for each record, but the interval width is not 
computed in terms of a rank percentage but in terms of a percentage p 
of the standard deviation of the variable. A measure of risk SDID can 
be obtained in a way analogous to the way RID is obtained for rank 
intervals. 

In both measures, we have considered 10 different interval widths (from p = 
1% up to p = 10%). The final result is the average of the results obtained for 
the 10 interval widths. 

4 Data Sets 

During the Research Meeting of the CASC project^ held in April 2002 in Ply- 
mouth, the need was detected for reference data sets to test and compare micro- 
data SDC methods. As a consequence, three data sets were proposed as reference 
data for numerical microdata protection. In our work, we have selected two out 
of those three data sets. The two selected data sets are: 

— “CENSUS” Data Set: This test data set was obtained using the Data 
Extraction System of the U.S. Bureau of the Census^. Specifically, from 
the available data sources, the “March Questionnaire Supplement - Person 
Data Files” from the Current Population Survey of year 1995 was used. 13 
quantitative variables were chosen from this data source: AFNLWGT, AGI, 
EMCONTRB, ERNVAL, FEDTAX, FICA, INTVAL, PEARNVAL, POTH- 
VAL, PTOTVAL, STATETAX, TAXING, WSALVAL. From the obtained 
records, a final subset of 1080 records was selected so that there were no 
repeated values in 7 of the 13 variables. This was done to approximate the 
behavior of continuous variables (where repetitions are unlikely). A more 
detailed description of the procedure followed to compile this data set can 
be found in [2]. 

— “EIA” Data Set: This data set was obtained from the U.S. Energy Infor- 
mation Authority and contains 4092 records^. Initially, the data file con- 
tained 15 variables from which the first 5 were removed as they corre- 
sponded to identifiers. We have worked with the variables: RESREVENUE, 
RESSALES, COMREVENUE, COMSALES, INDREVENUE, INDSALES, 
OTHRE VENUE, OTHRSALES, TOTREVENUE, TOTS ALES. 

5 Results 

The objective of our empirical work is to compare the measures for outliers with 
those obtained for the overall data set. We also study the general performance of 

^ http://neon.vb.cbs.nl/casc 

^ http : //www. census . gov/DES/www/welcome .html 

® http : //www. eia.doe . gov/ cneaf /electricity /page/e ia 826 .html 
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each measure for each masking method. First of all, we describe the procedure 
that has been applied to classify a record as an outlier. 

The procedure is as follows: 

1. To start with, all variables of the original data set are independently stan- 
dardized. This is done by subtracting from each value the average of the 
variable and then dividing by the standard deviation of that variable. 

2. Next, the average record is considered. Since data have been standardized, 
the co-ordinates of the average record correspond to the all-zeroes vector. 

3. The distance from each record to the average record is now computed. 

4. Finally, all records are sorted by their distance to the average record. The 
5% farthest records are classified as outliers. 

Next, we report on the results obtained for each masking method using the 
measures described in Section 3. Measures regarding outliers were computed as 
follows: 

— For information loss, ILl and ILls are computed taking for each outlier 
record in the original data set its corresponding masked record from the 
masked data set. 

— For disclosure risk, each masked outlier record is compared against all records 
in the original data set. 

These results are presented in several graphics. More specifically, a graphic 
for each measure and data set is presented. In these graphics, the black line 
indicates the measure obtained for the overall data set and the gray line indicates 
the results obtained for the outliers. 

Each graphic is divided into five sections corresponding each to a family of 
masking methods: JPEG (J), rankswapping (R), additive noise (N), resampling 
(RS) and microaggregation (M). Labels along the abscissae indicate the methods 
that have been tried: the label for a method consists of a letter indicating the 
method family and a number indicating a particular parameter choice. 



5.1 Information Loss: IL\ 

For the “CENSUS” data set (Figure 1), we observe: 

— With JPEG, additive noise and microaggregation, information loss is lower 
for outliers. This means they receive a lower perturbation. 

— With rank swapping, information loss is higher for outliers. As parameter 
choices change, ILl stays low without significant oscillations. 

— For low values of its parameter, JPEG causes the highest information loss. 
This parameter has a direct influence on the information loss measure. 

— The resampling method is the one presenting the lowest information loss. 

— In additive noise, the value of the parameter has a great influence on infor- 
mation loss, but less than in the case of JPEG. 
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IL1 (Census set) 

— Ovefal -»-Outhefs 




Fig. 1. ILl results for the “CENSUS” data set 

IL1 (ElA s«t) 

-♦-OvarU -•“Outliafs 




Fig. 2. ILl results for the EIA data set 



For the “EIA” data set (Figure 2), we observe: 

Results for the JPEG and additive noise methods are similar to those ob- 
tained in the “CENSUS” data set. The differences between ILl for outliers 
and for the overall data set are very sharp. The overall information loss is 
much higher than in the “CENSUS” data set. 
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— In rankswapping, the differences in information loss for outliers and for the 
overall data set depend on the parameter choice. The overall loss stays close 
to the one obtained for the “CENSUS” data set. 

— If microaggregation is used, the overall loss is much lower than in the “CEN- 
SUS” data set. Loss for outliers is similar to the loss for the overall data set 
until the parameter is close to 15; beyond this value, loss for outliers is a bit 
higher than for the overall data set. 

5.2 Information Loss: I Lis 

For the “CENSUS” data set (Figure 3), we observe: 

— With JPEG, rankswapping, resampling and microaggregation, the informa- 
tion loss for outliers is higher than when measured over the overall data 
set. 

— With additive noise, ILls is the same for outliers and for the overall data 
set. 

— Generally, additive noise and resampling are those with the lowest ILls. 
JPEG and rankswapping present a wide range of losses depending on the 
chosen parameters while microaggregation stays more stable when modifying 
its parameter. 

For the “EIA” data set (Figure 4), we observe: 

~ In all methods, the behavior of outliers is similar to those observed in the 
“CENSUS” data set. It is worth mentioning that, with rankswapping, the 
differences between outliers and the overall data set are magnified. 

— In general, ILls presents the same behavior than in the Census data set. 
The most evident difference is that, now, resampling and microaggregation 
are those with the lowest loss. 



5.3 Disclosure Risk: Record Linkage DLD 

For the “CENSUS” data set (Figure 5), we observe: 

— With JPEG, additive noise and resampling, DLD is higher for outliers than 
for the overall data set; with rankswapping and microaggregation, the con- 
trary happens. 

— Globally, resampling presents the highest DLD risk while microaggregation 
presents the lowest DLD. 

— As the parameter choice changes, we can observe that, for JPEG and rank- 
swapping, DLD oscillates in a very wide range; for additive noise, the DLD 
range is more moderate; for microaggregation, the DLD range is narrowest. 

For the “EIA” data set (Figure 6), we observe: 
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Ills (Census set) 
Ovsral Outliers 




Fig. 3. ILls results for the “CENSUS” data set 

ILls(EIA set) 

— Okrerall -•—Outlws 




Fig. 4. ILls results for the “EIA” data set 



With JPEG, additive noise, resampling and microaggregation, the DLD dis- 
closure risk is higher for outliers than for the overall data set. The difference 
between the two groups of individuals is very large in JPEG and additive 
noise, while in microaggregation the difference is very small. 

Rankswapping presents a lower DLD for outliers than for the overall data 
set. 
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Fig. 6. DLD measures for the “EIA” data set 



— Even though the DLD value for most methods is heavily dependent on the 
parameter choice, microaggregation stands out as a method where DLD is 
fairly stable. In rankswapping, DLD is very low for parameter values higher 
than 12. 
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Fig. 7. RID measures for the “CENSUS” data set 



5.4 Disclosure Risk: Confidence Intervals RID 

For the “CENSUS” data set (Figure 7), we observe: 

— With JPEG, rank swapping and additive noise, RID is higher for outliers 
than for the overall data set. Microaggregation behaves the other way round. 

— Resampling has the highest RID values, which are very close to the maxi- 
mum. 

— Microaggregation has always lower RID values than additive noise. 

— As the parameter choice changes, RID oscillates in a very wide range for 
JPEG and rankswapping; for additive noise, the range is only moderately 
wide; microaggregation displays a narrower range for RID. 

For the “EIA” data set (Figure 8), we observe: 

— With all masking methods, outliers present higher RID values than the 
overall data set. This difference is very large for JPEG and additive noise, 
moderate with microaggregation and small with rank swapping. 

— RID values in resampling are extremely high again. 

— The width of the range of RID values as parameters change exhibits the 
same behavior, for all methods, than in the “CENSUS” data set. 

— Unlike in the previous data set, the behavior of additive noise is better than 
that of microaggregation. 



5.5 Disclosure Risk: Confidence Intervals SDID 

For the “CENSUS” data set (Figure 9), we observe: 
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Fig. 8. RID measures for the “EIA” data set 
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Fig. 9. SDID measures for the “CENSUS” data set 



— With all methods, except additive noise and some parameterizations of 
JPEG, SDID risk is lower for outliers than for the overall data set. With 
additive noise and JPEG with high values on the parameter, no difference 
between the two groups of individuals is perceived. 

— Resampling presents the highest SDID disclosure risk. 

— In general, microaggregation presents better results than noise. 




Outlier Protection in Continuous Microdata Masking 



213 



SDID (EIA s«) 
-♦-Overal -«-Outfcers 




Fig. 10. SDID measures for the EIA data set 

— JPEG with low and moderate parameter values, rankswapping with medium 
and high parameter values and microaggregation with any parameter present 
similar values on the SDID measure. 

— As the parameter choice changes, the range of SDID for additive noise is 
very wide. Except for high parameter choices of JPEG and low parameter 
choices of rank swapping, this range for JPEG, rankswapping and microag- 
gregation is small. 

For the “EIA” data set (Figure 10), we observe: 

— In the various masking methods, SDID values for outliers and for the overall 
data set compare in a similar way as for the “GENSUS” data set, with the 
variation that differences are sharper here. 

— In general, resampling presents the highest SDID disclosure risk again. 

— Also in general, the range of SDID values, depending on the chosen param- 
eter, is very wide for additive noise, moderate for JPEG and rankswapping, 
and very narrow for microaggregation. 



6 Conclusions 

From the results outlined in the previous section, we can enumerate some con- 
clusions of the comparison between the masking methods considered in terms of 
information loss and disclosure risk. Note that our conclusions are based on the 
results obtained with the two studied data sets. However, a dramatical change 
of results with other data sets is unlikely. There are two types of conclusions: 
general or focused on the behavior of outliers with respect to the overall data 
set. 
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— Resampling presents the lowest information loss but also the highest disclo- 
sure risk. This means that this method is not suitable for SDC. 

— Additive random noise has an irregular behavior with respect to other meth- 
ods when measuring information loss; for instance, it has low IL\s values 
in the “CENSUS” data set but high ILl values in the “EIA” data set. Re- 
garding disclosure risk measures, additive noise usually yields higher values 
than other methods; in addition, there is a problem with outliers, for which 
the risk is higher than for the overall data set. This means that this method 
should be used very carefully, especially when outliers are present that should 
be protected. 

— Measures on JPEG take a very wide range of values depending on the cho- 
sen parameter. When information loss is low, disclosure risk is high, and 
conversely. Thus, this method is not appropriate for high or low parameter 
values. It may be usable for medium parameter values. When measuring dis- 
closure risk {DLD and RID), one finds out that the disclosure risk is higher 
for outliers, especially in the “EIA” data set. Care should be exerted when 
using this method. 

— Rankswapping is a good masking method especially when taking moderately 
high parameter values. For these values of the parameter, disclosure risk 
measures stay low while information loss, particularly ILl, does not increase 
significantly. Regarding the behavior of disclosure risk for outliers, we observe 
that outliers incur a lower risk than the overall data set for the DLD and 
SDID measures; for the RID measure, the behavior is the opposite but 
the difference between outliers and the overall data set is not really big. In 
summary, rankswapping offers more protection to outliers, which is positive. 

— Microaggregation is a good masking method as it presents low disclosure risk, 
especially in the “CENSUS” data set, while information loss stays moderate 
(lower in the “EIA” data set). The parameter choice is less influential than 
in rankswapping: the behavior for microaggregation is more robust, even 
though medium to high values are the best parameter choices. Regarding 
outliers, this method behaves similarly to rankswapping, except for DLD in 
the “EIA” data set and RID in the “CENSUS” data set. 

We conclude that the best methods, both in general and in regard to outliers, 
are rankswapping with moderately high parameter values, and microaggregation 
with medium to high parameter values. 
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Abstract. Statistical agencies often mask (or distort) microdata in public-use 
files so that the confidentiality of information associated with individual entities 
is preserved. The intent of many of the masking methods is to cause only minor 
distortions in some of the distributions of the data and possibly no distortion in 
a few aggregate or marginal statistics In record linkage (as in nearest neighbor 
methods), metrics are used to determine how close a value of a variable in a re- 
cord is from the value of the corresponding variable in another record. If a suf- 
ficient number of variables in one record have values that are close to values in 
another record, then the records may be a match and correspond to the same en- 
tity. This paper shows that it is possible to create metrics for which re- 
identification is straightforward in many situations where masking is currently 
done. We begin by demonstrating how to quickly construct metrics for con- 
tinuous variables that have been micro-aggregated one at a time using conven- 
tional methods. We extend the methods to situations where rank swapping is 
performed and discuss the situation where several continuous variables are mi- 
cro-aggregated simultaneously. We close by indicating how metrics might be 
created for situations of synthetic microdata satisfying several sets of analytic 
constraints. 



1 Introduction 

With the advent of readily available computing power and straightforward software 
packages, many users have requested that significantly more public-use files be made 
available for analyses. To create the public-use files, statistical agencies mask or 
distort confidential data with the intent that records associated with individual entities 
cannot be re-identified using publicly available non-confidential data sources. Agen- 
cies adopted many of the masking methods primarily because they could be easily 
programmed and because other agencies had used the methods. 

The primary intent in producing a public-use file is to allow users to reproduce 
(approximately) certain statistical analyses that might be performed on the original, 
confidential microdata. 

To produce such files, agencies typically need to describe what masking method or 
methods they used, adjustments that users might need during a statistical analysis to 
get results corresponding to results on the confidential microdata, and limitations of 
the distributional characteristics of the data. The masked microdata are often in- 
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tended to reproduce some of the distributional characteristics of individual variables 
and groups of variables. For instance, if users want to produce a regression analysis 
Y = X P, the agency may use a masking method that allows this type of regression. 
Palley and Simonoff [26] have observed that, if the user has the independent X vari- 
ables in a file, then the user may use the beta coefficient P to predict, say, an income 
variable Y. For a high-income individual with other known characteristics such as 
age range, race, and sex, the income variable may be sufficient to allow re- 
identification. The value of the Y variable that can be associated with the individual 
might be referred to as a predictive disclosure of confidential information. Alterna- 
tively, if the user (or intruder) has a file with a variable Y’ that corresponds to the Y 
variable and has information about age range, race, and sex of an individual, the in- 
truder may be able to re-identify an individual with the X information in the public- 
use file. 

During the re-identification of (age range, sex, race, Y) with (age range, sex, race, 
Y’), we use a crude metric that states that the first three variables should agree exactly 
and the last variables Y and Y’ should agree approximately. If Y and Y’ are known 
to be in the tails of a distribution such as a high income or unusual situation, then we 
may be able to deduce that a range in which we can say Y and Y’ are likely to be 
approximately the same. In some situations, we have a crude functional relationship 
f(X) = Y that allows us to associate the X variables with a predicted Y variable that 
may be close to a corresponding Y’ variable. In these situations, we can think of the 
functional relationship f(X) = Y and other knowledge as yielding a metric for the 
distance between Y and Y’. The variables Y and Y’ can be thought of as weak identi- 
fiers that allow us to associate a record in the public-use file with one or more records 
in the intruders file. The intruder’s file may contain publicly available information 
along with an identifier such as name and address. 

Statistical agencies have often evaluated the potential confidentiality of a public- 
use file using elementary methods. For instance, they may have a public-use file with 
a combination of (X^, X^,) with discrete variables X^ and continuous variables X^, that 
they wish to associate with a potential intruder file (X^’, X^’). In some situations, the 
intruder file (X^’, X^’) might be the original, unmasked data. To compare records, 
they may apply a sort/merge utility that does an exact comparison of (X^, X^,) with 
(Xp’, Xf). Because continuous variables often show minor variations, the exact 
comparison will not identify corresponding records. In making the comparison, the 
agency may be assuming that the intruder only has continuous and other data that do 
not correspond exactly to the agency’s original data or to the masked data. Addition- 
ally, the agency may often assume that the intruder has a subset of the data (X^’, X^,’) 
to compare with (X^, X^,). A complication associated with the assumption that the 
intruder has a subset of the data (X^’, X^,’) is that it often only takes a subset of the 
data to identify a small proportion of the records. Additionally, as noted above, the 
intruder may have knowledge of the analyses that can be performed using the data 
(Xp, Xp) and additional data sources that allow him to create a larger subset of the 
variables in (X^’, X^,’). 

More sophisticated record linkage and other methods have been developed in the 
computer science literature. The newer re-identification methods were developed for 
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linking administrative lists (Winkler [42]) in which names and addresses were not of 
sufficient quality for accurate linkages. Individuals use other identifiers such as geo- 
graphic identifiers (when available), numeric data such as income and mortgage 
payments that are often available in commonly used public databases, and functional 
relationships between variables (Scheuren and Winkler [32]). Much more sophisti- 
cated link analysis methods McCallum and Wellner [20] and Bilenko et al. [1] are 
currently being developed for linking information from large numbers of web pages 
nearly automatically. Further, if users have fairly sophisticated knowledge of how 
the masking is done and of the analyses that might be done with a file, they can use 
additional, ad hoc, file- and analysis-specific methods to re-identify (Lambert [19]). 

This paper demonstrates how to quickly construct new metrics associated with in- 
dividual variables and groups of variables and add them to re-identification software. 
The construction of the more sophisticated metrics for linking administrative lists and 
link analyses has become increasingly more straightforward. Covering the most 
advanced methods is beyond the scope of this paper. We begin by showing how to 
construct metrics to re-identify in situations where variables in files are micro- 
aggregated. It is straightforward to extend to situations where moderately high pro- 
portions can be re-identified even when sampling proportions are on the order of 
0.1% and upwards of 30% of the variables have errors in them [44], [23]. 

A fundamental concept is that, if the users of the microdata have accurate informa- 
tion about the underlying distributions of variables in confidential files, then they may 
have information sufficient for re-identification of some of the records. If the statisti- 
cal agency needs to evaluate the confidentiality of a potential public-use file, then it 
needs to use more sophisticated re-identification methods than those that have some- 
times been used previously. If a small proportion of records in the potential public- 
use file can be re-identified, then it may be possible to apply additional masking pro- 
cedures to coarsen the file. The coarsening is intended to reduce or eliminate re- 
identification while preserving many of the analytic properties of the masked file. 
Kim and Winkler [17] determined that a small proportion of records in a file masked 
using additive noise only might be re-identified. To better assure confidentiality of 
the public-use file, they applied an additional masking procedure in which they 
swapped information in a set of specified subdomains. After assuring that the first 
and second masking procedures assured disclosure avoidance, they released the pub- 
lic-use file. 

In ordinary single-ranking micro-aggregation, we sort each individual variable and 
aggregate the values of each variable into groups of size k. In each group, we replace 
the individual values by an aggregate such as a median or the average. Typically, k is 
taken to be 3 or 4. If k is greater than 4, then varying individuals have shown that 
basic analytic properties such as regression can be seriously affected. Our key idea is 
that micro-aggregation almost precisely tells us the underlying distributions of indi- 
vidual variables. It is straightforward to construct highly optimal metrics based on 
the reported micro-aggregates in the public-use file. If two or three variables are 
uncorrelated, then the metrics associated with them may allow re-identification. If 
there is sampling at very low proportions and only a moderate proportion of variables 
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(say 3 of 10) have severe distortion, then the redundancy due to the accurate variables 
can overcome the inaccuracy of the remaining variables. 

In developing additional metrics for other masking situations, we make the key as- 
sumption that the public-use file is created in a manner that allows one or two sets of 
analyses. If the public-use file cannot be demonstrably proven to have analytic prop- 
erties, then the distributions of the masked variables might not allow us to construct 
highly specific metrics that are optimized for re-identification in a specific set of files. 
For instance, in an extreme situation, we might replace the value of each continuous 
variable with its average value over the entire file. If the discrete variables associated 
with a record by themselves do not allow re-identification, then we would not be able 
to re-identify using the combination of discrete and continuous variables. In this 
extreme situation, it is unlikely that the analytic properties of the microdata file would 
be any more useful than the tabulations in the tables in existing publications. 

The outline of this paper is as follows. In the second section, we give additional 
background on the identification methods that are currently being used in computer 
science. With background in the methods, it becomes straightforward to understand 
the construction of metrics, the use of redundant information, and the notions of dis- 
tance between objects (records) taken from a variety of sources. In the general com- 
puter science setting of matching administrative and other lists, the interest is on iden- 
tification rates above 90%. For potential masked, public-use files, we only need the 
much more attainable 0.1-2% re-identification rates to determine that the masking is 
not sufficient to release the file to the public. In the third section, we present a sum- 
mary of how to construct the metrics that can be added to software to enhance re- 
identification. We also go into some detail about the underlying concepts to provide 
background on how straightforward it is to create metrics for other situations. In the 
fourth section, we show how to construct metrics for rank swapping and indicate how 
they might be constructed in other situations. In the fifth section, we provide ideas 
related to what might be expected in terms of re-identification with synthetic data. 
On the surface, it is intuitive that synthetic data is artificial and does not correspond 
to any individual entity. Fienberg [10], among others, has indicated that if sufficient 
analytic restraints are placed on the synthetic data, then some (approximate) re- 
identification might be possible. We provide ideas on how we could associate a small 
proportion of the records in a synthetic database with specific individuals. Based on 
the linkages, we show how the information in the synthetic data would allow us to 
deduce confidential information about an individual entity. In the sixth section, we 
provide additional caveats by connecting the methods of this paper with some of the 
ideas and results in the literature. The final section consists of concluding remarks. 



2 Background 

With increasing demand from users, statistical agencies are creating more and more 
masked, public-use files that can be analyzed. Agencies seldom demonstrate that the 
public-use files can be used for analyses that correspond reasonably well to analyses 
that might done with the original, non-public data. Users assume that the masked, 
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public-use files can be used for many analyses. They also assume that records in the 
files cannot be re-identified with individuals using publicly available non-confidential 
files. 

Over the last few years, there has been remarkable progress, primarily in the com- 
puter science literature, in linking records associated with individual entities (either 
persons or businesses) from a variety of sources. Some of the ideas originated in 
record linkage (see e.g. [32]) where many identifiers such as name, address, age, and 
income can have substantial errors across files. With economic variables such as 
income from one source file and receipts from another file, Scheuren and Winkler 
[32] have shown how to significantly increase accuracy in linkages of administrative 
lists. The correlations between the variables allowed Scheuren and Winkler [32] to 
create additional weak identifiers called predictors. Straightforward metrics associ- 
ated with the predictors brought records from one file closer to smaller subsets of 
other records in another file in comparison to the situations in which the predictors 
and associated metrics were not used. Using ideas that were independently suggested 
by economists and record linkage practitioners at Statistics Canada, Winkler [43] has 
shown how to use auxiliary population files to improve linkage accuracy. For in- 
stance, assume that the population file contains a set of variables Zj, ..., that are 
either contained in one of the files being matched but not in both simultaneously. If a 
record from one file is associated with several records in the second file, then the Z- 
variables may reduce the association to 1 or 0 records in the second file. 

Using Probablistic Relational Models, Getoor et al. [12], Taskar et al. [33], [34], 
[35], [36], and Roller and Pfeffer [18] have shown how to systematically and itera- 
tively improve linkages in a set of files. In extreme situations, Torra [38], [39] has 
shown how to create aggregates from quantitative and other data that can be used for 
linkages. Further, McCallum and Wellner [20] and others have shown how to use 
Markov Random Fields and Graph Partitioning algorithms to systematically increase 
the likelihood of a set of linkages between corresponding records in a group of files. 
The latter methods are often used for extracting and linking information (entities or 
objects) from a group of web pages. 

In this paper we concentrate on the most elementary of the methods that corre- 
spond (roughly) to distance between records in a metric space such as might be used 
in nearest neighbor matching. Record linkage methods use metrics that scale the 
ranges of variables automatically and partially account for dependencies between 
variables (Winkler [42]). We show the validity of the re-identification ideas for 
masking methods that are known to produce analytically valid micro-data in a few 
situations. Because micro-aggregation methods are often used by statistical agencies, 
we begin by demonstrating how to compute re-identification metrics for micro- 
aggregation and show how to create analogous metrics for other situations. If we 
understand how easily masked, public-use records can be re-identified using the ele- 
mentary methods that are being increasingly applied, then it may be possible to de- 
velop better masking strategies. We do not cover details of the new, more advanced 
methods. The advanced methods might be applied in situations where the more ele- 
mentary methods do not allow accurate re-identification. Their application might 
lead to even better protection strategies. 
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3 Re-identification for Single-Ranking Micro-aggregation 

In this section we provide a summary of the basic re-identification ideas given by 
Winkler [44]. Muralidhar [23] independently verified the methods. Both Muralidhar 
and one of his graduate students were each independently able to verify the ease in 
constructing highly optimized re-identification metrics and their efficacy in re- 
identification with micro-aggregated data. We provide additional intuition on the 
underlying concepts so that the extensions and related ideas in subsequent sections 
can be better understood. 

We consider a rectangular data base (table) having fields (variables) X i=l,...,n, 
and value states x^., In many microdata confidentiality experiments, users 

want 10 or more variables X.. We assume that each of the variables X is continuous, 
skewed, and not taking zero value states. The second assumption eliminates a few 
additional technical details. It can easily be eliminated. The third assumption is for 
convenience. It is not generally needed for the arguments that follow. 

We begin our discussion by considering databases with 1000 or more records and 
situations in which micro-aggregation is on one variable at a time. In this discussion, 
we demonstrate that micro-aggregation as currently practiced allows almost perfect 
re-identification with existing record linkage procedures even when k is greater than 
or equal to 10. We can easily develop nearest-neighbor methods with similar metrics 
that have almost 100 percent re-identification rates. 

We chose any three variables, say X^, X^, and X^ that are pairwise uncorrelated {R^ < 
0.2). Our procedure is for aggregating variables one at a time. Although we use 3 
variables in the following description, re-identification may occur with only two or 
with four or more variables in analogous other situations. Within each variable, sort 
the values and aggregate into groups of size 3 or more. Let the new micro-aggregated 
value-states be denoted by a. (x)= y.., j = 1, .... L , i = 1, 2, 3 where a. () is the aggre- 
gation function. Each (aggregated) value state is assumed three or more times (3 or 
more records have the same value of the y-variables). Most aggregates will be from 
three value-states only. In the following y.^., will denote the j. value-state of micro- 
aggregated variable Y.. The micro-aggregated value will be a value such as the 
average or median. Such a value is in the range of the values being micro-aggregated. 
We develop new record linkage metrics (or nearest neighbor metrics) as follows. The 
metrics are for matching a micro-aggregated record R with the original set of data 
records. Let R = (y^.,, y^.^ ) = (a, (x, J, a, (x^J, a, (x,J) where y’s are values 

aggregated by the aggregation operator (.) from original values x’s. Using the sort 
ordering for individual variables, for each i, let piy^J be the predecessor of y.. and 
s(yiji) be the successor of y... In each situation, the predecessor and the successor are 
distinct from the value y^^.. For y.^., let the distance be metric dist (x, y^J be 1 if jr is 
within distance min (abs (y.^. - p(yi )), abs(y^..-s(y..J)/2 of y,..; 0, otherwise. 

This metric allows us to match the X-variables in the original file with the T-values 
in the micro-aggregated file. The metric is highly optimized. It is based on the dis- 
tribution of the micro-aggregated variables in the public-use file. Each X-variable in 
the intruder file will be associated with at most one T-variable in the public-use file. 
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Suitable adjustments should be made for being at the end of the distributions (i.e., 
one-sided). Let N be the number of records in the original database. Then micro- 
aggregated record R has probability close to one of matching with its true correspond- 
ing original record. The probability is at least ((N-3)/N) on each field X that we 
match with its corresponding micro-aggregated field Y.. It has probability close to 
zero of matching with any record other than its original corresponding record on each 
field. Based on independent empirical work, the possibility of re-identification with 
single-ranking micro-aggregation has been observed by Domingo et al. [7], [9]. The 
procedure described above provides a systematic method of re-identifying in all situa- 
tions where micro-aggregation is used. 

The extensions to cover more general (and realistic) situations are straightforward 
(Winkler [44]). The intruder who may be using public use data will often have name, 
address, and other identifying information that can be associated with individual 
records. The intruder will only use the quantitative data in his files to re-identify 
records according to the names and other identifying information is his files. If sam- 
pling fractions as low as 0.001 are used, then the metrics can be constructed that still 
allow us to separate a moderate proportion of records in the public-use sample and 
associate with it in the intruder database. If a moderate number of variables among a 
group of ten or more variables have severe error (above thirty percent), then we may 
still be able to use the remaining, more accurate variables to link records. 

We repeat some of the key concepts regarding re-identification to improve the in- 
tuitive understanding. If we know that we have two overlapping populations, then it 
may only take two or three variables to re-identify a proportion of the records. Each 
variable is a weak identifier that allows us to associate the record in the public-use 
file with a subset of the records in the intruder data files. Each variable allows us to 
associate the variable with a different subset of records. Record linkage procedures 
(or more crudely nearest neighbor procedures) allow us to take an efficient intersec- 
tion of the records in the second file that might be related to the record in the first file. 
In a number of situations, this procedure allows us to re-identify a proportion of the 
records in the masked file with high probability. The record linkage methods are 
good (efficient) at automatically accounting for the redundancy in a set of agreeing 
variables (see e.g., [17], [45]). 



4 Re-identification for Other Basic Masking Methods 

In this section, we describe possible extensions of the metric-construction procedures 
to rank swapping. We provide issues related to constructing metrics for data in which 
micro-aggregation is by several variables simultaneously and in which additive noise 
is used. 

Rank swapping (Moore [22]) has somewhat similar characteristics to micro- 
aggregation. We begin by sorting individual continuous variables in the file, say, in 
decreasing order. An a priori rank- swapped range p is chosen in which each value of 
each variable in a record is swapped with the value of a corresponding value of the 
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same variable in another randomly chosen record that is within p% of the ordered 
range of the first record. The proportion p is typically between 15 and 20%. If the 
number of records in the file is even, then each record is swapped once. If the num- 
ber of records is odd, then one additional record will be swapped twice to assure that 
the value in the extra record is swapped once. At the very end of the swapping, re- 
cords (those records with the smallest values) cannot be swapped the full p% of the 
range. 

If the rank-swapping procedure were repeated over different sets of random num- 
bers, then on average the replacement value for a given record would be the average 
of the records in the p% range of the records. As with micro-aggregation, if the value 
p is above 1% and the number of records in the p% range is above 10, then the ana- 
lytic distortions in the resultant data can be very severe. This is particularly true on a 
subdomain in which the rank-swapping procedure exchanges values of records in the 
subdomain with values with records in other subdomains. 

Even in extreme situations, we will be able to re-identify. If p is equal 1% and the 
number of records in the p% range is over 100, each value of a variable allows us to 
construct a metric in which a given record can be associated with at most 1-2% of the 
other records. This is similar to the micro-aggregation situation. Each value of a 
variable in a record is a weak identifier that allows us to tentatively link the record to 
1-2% and tentatively not link the record to 99-98% of the records in the file. As we 
accumulate potential linkages over several of the weak-identifiers (variables), we can 
link a moderate or small proportion of the records with reasonable confidence (prob- 
ability above 0.5). 

If we are matching the masked file with the original unmasked file (i.e., no sam- 
pling) and we assume that the original values do not have severe (more than 30% 
distortion), then, with three uncorrelated variables and the newly constructed metrics, 
it seems likely that we will be able to re-identify a moderate to high proportion of 
records. If there is sampling with small proportions and there is a substantial number 
of variables (say 10) in which a small proportion of variables have severe errors, then 
it seems likely that we will still be able to re-identify a moderate or small proportion 
of the records. The re-identification proportion may be less than the corresponding 
proportions in the micro-aggregation situation because the rank-swapping optimized 
metrics are used in a much greater range of values (i.e., much larger k) than in the 
micro-aggregation situation. 

Domingo-Eerrer and Mateo-Sanz [6] have shown how to micro-aggregate using 
two or more variables simultaneously. As they show, analytic properties of the 
masked data degrade much more rapidly than in single-variable micro-aggregation. 
The degradation is intuitive because if we simultaneously micro-aggregate on three 
uncorrelated variables, then the resultant aggregates of the three variables are unlikely 
to preserve correlations among themselves and with other continuous data in the files. 
A crude analogy is if we use k-means to cluster data and then micro-aggregate within 
clusters. If we micro-aggregate using more than three variables simultaneously, then 
the degradation of analytic properties is likely to be greater than in the three-variable 
simultaneously situation. In many situations, it seems likely the multi-variable micro- 
aggregation procedure will preserve confidentiality. The confidentiality is due to k- 




224 William E. Winkler 



anonymity [30], [31] because each masked record is likely to be associated with at 
least k records. 

At present, we are uncertain how to compute highly specific re-identification met- 
rics for additive noise (Kim [15], Fuller [11]) or mixtures of additive noise (Yancey 
et al. [45]). Yancey et al. [45] showed that mixtures of additive noise provided a ten- 
fold reduction in re-identification rates in comparison to additive noise (Kim and 
Winkler [17]) while not seriously compromising analytic properties. Brand [2], in a 
nice tutorial paper, has given more details on how additive noise can compromise the 
analytic properties of the masked data. Her ideas might be used to determine addi- 
tional re-identification metrics. Recent work (Kargupta et al. [14]) suggests how bet- 
ter re-identification metrics for additive noise might be constructed. Because Kar- 
gupta et al. [14] makes use of ideas from the signal processing and random matrix 
literature, it may target data situations with considerably less inherent noise and varia- 
tion than survey data. The Domingo et al. [8] density-estimation procedure is in- 
tended to estimate a reconstructed probability distribution of the original, unmasked 
data. It may suffer from the curse-of-dimensionality problems where the amount of 
data needed in multivariate situations grows at an exceptionally high exponential rate. 
It is not clear that the Kargupta et al. [14] or the Domingo et al. [8] procedures can 
deal with mixtures of additive noise. Even in the situations using additive noise, 
more research is needed to determine whether the methods of Kargupta et al. [14] or 
the Domingo et al. [8] would yield re-identification rates higher than those obtained 
by Kim and Winkler [17]. 



5 Synthetic Data 

Palley and Simonoff [26], Fienberg [10], and Reiter [29] have pointed out that it may 
still be possible that synthetic data may contain some records that allow re- 
identification of confidential information. Fienberg [10] has given additional meth- 
ods of re-identification that can be used with either original data that has been masked 
or synthetic data. Individuals create synthetic data from models that preserve some of 
the distributional assumptions of the original, confidential data and allow a few analy- 
ses that correspond roughly to the analyses that might be performed on the original, 
unmasked data. An outlier or value in the tail of a distribution in a record in the 
synthetic data may be much closer to the record in the original data and the corre- 
sponding record data available to the intruder than to other information. The intuition 
is that, when much of the synthetic data in the interior of a distribution, an intruder 
can only determine that the synthetic record corresponds to k > 3 records in the in- 
truder’s file. The outlier may allow the intruder to determine that the synthetic record 
is likely to correspond to at most one record in the intruder’s file. 

In this section, we describe a situation in which some of the synthetic data might 
be re-identified (in terms of yielding values in fields or variables) that are reasonably 
close to confidential values of those fields and can be associated with names and 
other identifiers in the intruder’s file. We make several assumptions that have been 
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made by others who have provided methods for generating analytically valid syn- 
thetic microdata. The first is that the original microdata is free of frame errors from 
undercoverage and duplication and free of edit/imputation errors. The second is that 
the model (or set of distributions) that is based on the original microdata allows a 
reasonable number of analyses in the synthetic data that correspond to analyses on the 
original microdata. The data producer describes the distributional assumptions and 
the possible limitations of any analyses so that users of the synthetic data can perform 
valid analyses within the limitations of the synthetic data. 

The re-identification metrics we might use are determined by the set of plausible 
probability distributions that correspond to the synthetic data. Each distribution can 
determine one or more metrics. The simplest situation is that described by Fienberg 
[10] for synthetic data and Lambert [19] for any data. If a value is an outlier in a 
distribution, than there may only be one plausible value in a record of the intruder 
that corresponds to that value of the record in the masked file. The identifying infor- 
mation in an intruder’s file can be used to compromise even the synthetic data. In 
this discussion, we use the term outlier to represent a value of a variable that is in the 
tail of a distribution. If the synthetic data allows more and more analyses, it will have 
corresponding more distributions and metrics that can be used to determine more 
outliers. Each of the outliers in the distributions of the synthetic data may yield re- 
identifications. As done by Palley and Simonoff [26] and Fienberg [10], it is possible 
that information from the non-outliers in the synthetic data and aggregate characteris- 
tics of the population such as what types of individual entities in the population may 
yield information for further improving the re-identification of outliers. A class of 
examples for which this is true is the class of regression relationships that give good 
predictive power (i.e., low variance in this situation) of a given variable such as in- 
come when the values of other variables and the coefficients of a valid regression 
relationship are known. 

To cast further insight, we provide more detailed examples. The first example is 
where we produce synthetic data for which only one very simple set of analyses 
might be performed and for which re-identification is highly unlikely. The example 
builds intuition on why valid distributional properties in the synthetic data are neces- 
sary for building re-identification metrics. The example corresponds roughly to the 
ideas of Kim and Winkler [17]. We have data (X, S) where X is continuous micro- 
data corresponding to information such as income and mortgage and S is discrete 
corresponding to information such as age, race, and sex. The potential users of the 
data specify that they wish to perform regression analyses on the data with the em- 
phasis on the subdomain specified by the S-variables. We obtain the means and co- 
variances of the X variables on each of the subdomains determined by S. We gener- 
ate synthetic data Y such the means and covariances of the Y-variables on the 
subdomains correspond to the means and covariances of the X-variables on the same 
subdomains. Sample sizes in each subdomain are taken to be at least 500 because 
covariances of the Y variables do not stabilize to values of covariances of X until 
sample size is sufficiently large. The slow stabilization is due to the nature of gener- 
ating multivariate random variables satisfying a number of analytic restraints. 
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Because there are an exceptionally large number of ways to generate the Y- 
variables, it is intuitive that in many situations there is no chance that the outliers in 
the Y distribution will correspond closely with outliers in the X distribution with high 
probability. The exception is when the X-variables have multivariate normal 
distribution, the Y-variables have multivariate normal distribution and the number of 
dimensions (variables) n are increased (Paas [25], Mera [21]). We observe that users 
of the synthetic data will be able to reproduce (approximately) regression analyses on 
some of the subdomains. The data, however, are virtually useless because they can- 
not even be used for regression on the entire file (independent of the S variables) and 
examination of simple statistics such as rank correlations. If we put additional re- 
straints on the generated Y variables such that it preserve regressions on some of the 
subdomains from aggregating the basic subdomains from the S variables and preserv- 
ing a few of the rank correlations, then it is likely that we will need to have a consid- 
erably larger sample size in each of the subdomains determined by S and that the set 
of analytic restraints will yield some outliers in the Y data that lead to re- 
identification. 

We need to better understand how valid analytic relationships, including certain 
aggregates such as regression coefficients and covariances of variables can yield 
predictive ability with properly constructed metrics that correspond to valid analyses. 
To do this we need to give an overview about how one creates a model for data. In 
the following, we will use model, valid parametric form, and distribution to mean the 
same thing. If we generate synthetic data from the valid model, then the synthetic 
data will satisfy one or two analytic properties of the original data. 

Although there are a number of good examples of the modeling process, we prefer 
Reiter [29] because it is representative of and builds on a number of good ideas intro- 
duced in earlier work. Using the general multiple-imputation framework of Raghuna- 
than et al. [28], Reiter shows how to create a model through a systematic set of re- 
gressions (or imputation models) that use subsets of variables to predict other 
variables. The key component of the modeling procedure is the estimation of the 
conditional and joint probabilities associated with the data and the proposed analysis. 
Although Reiter’s method targets multiple-imputation, we could also use maximum 
entropy methods (Polletini [27]), multivariate density estimation (Domingo et al. [8]), 
or Latin Hypercube methods (Dandekar et al. [4], [5]). 

The models have the effect of estimating the probability distribution of a variable 
conditional on specific values of the (independent) predictor variables. The estimated 
distributions can serve as predictors of the value of a variable given the values of the 
variables upon which it is conditioned. The estimated distributions can be used as 
new metrics for re-identification. As Reiter observes: “When there are parameters 
with small estimated variances, imputers can check for predictive disclosures and, if 
necessary, use coarser imputation models.” Alternatively, we might state this as “If 
the model allows predictive values that might lead to re-identification, then we might 
reduce the analytic effectiveness of the model to reduce predictive disclosure risk.” 

What we can observe is that the Reiter example (and earlier examples due to 
Palley and Simonoff [26], Lambert [19], and Fienberg [10]) uses one variable (possi- 
bly in the tail of the distribution or in a suitably narrowed range) to obtain predictive 
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disclosure. If we use a substantial number of variables in the model and potential 
predictive disclosures are possible with different subsets of them, then it is possible 
that predictive disclosure can increase if suitable metrics are created and placed in 
record linkage software. This is a research problem. 



6 Discussion 

In the earlier part of the paper, we demonstrated how to quickly create metrics that 
allow re-identification with widely used masking procedures such as micro- 
aggregation. A key feature was that micro-aggregation, as it is typically applied, 
gives exceptionally good information about the distribution of each individual vari- 
able. If the distributional information is used to create a set of metrics for a set of 
variables, then high rates of re-identification are quite possible. This is particularly 
true is there are a moderate number of continuous variables that are pairwise uncorre- 
lated or only moderately correlated. 

If synthetic data is produced through a valid parametric modeling procedure, then 
we suspect that re-identification rates using single variables may be in the range 
0.001-0.01. Our low estimate of re-identification rates is based on the re- 
identification rates with mixtures of additive noise (Yancey et al. [45]. With mixtures 
of additive noise, re-identification at rates of 0.001 to 0.01 occurs primarily with 
outliers in the tails of distributions. Although determination of general re- 
identification rates with different types of synthetic data is a research problem, we 
would expect the re-identification rates to be relatively low. Rates this low may be 
sufficient to assure confidentiality. Palley and Simonoff [26], Lambert [19], Fienberg 
[10], and Reiter [29] have all given examples specific to different types of files and 
types of analyses that make plausible conjectures with respective to predictive disclo- 
sure using single variables at a time. If we use a substantial number of variables and 
have valid information about a model (i.e. probability distribution) representing them, 
then it seems likely that we can construct metrics that increase re-identification rates. 
In the simpler situations where re-identification might occur, Reiter [29] suggests 
coarsening the models to reduce predictive disclosures. 

A straightforward procedure may be for the data producer to perform a direct re- 
identification between the synthetic data and the original data used in the modeling. 
This can quickly identify potential records in the synthetic data that may lead to pre- 
dictive disclosure or identify disclosure. Kim and Winkler [17] delineate those re- 
cords that were most at risk of re-identification when additive noise was used. Their 
coarsening procedure was to swap information in the at-risk records with the not-at- 
risk records. As noted by [17], the coarsening procedure had the effect of reducing a 
number of the analytic properties of the public-use data. Alternatively, Yancey et al. 
[45] used mixtures of additive noise that reduced disclosure risk by a factor of 10 
while not significantly reducing the analytic properties in the masked file. 

Dandekar et al. [4], [5], Grim et al. [13], and Thibaudeau and Winkler [37] have 
all given methods for generating synthetic microdata that do not involve as much 
detailed modeling effort as those presented by Reiter [29]. The Latin-Hypercube 
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methods of Dandekar et al. [4], [5] may represent a practical alternative. The meth- 
ods of Grim et al. [13] and Thibaudeau and Winkler [37] may use approximations 
that compromise many of the analytic properties. 



7 Concluding Remarks 

This paper provides methods for constructing re-identification metrics that can be 
used with a few of the masking methods that are commonly used by statistical agen- 
cies for producing public-use files. We concentrate on a few masking methods that 
are known to produce files with one or two analytic properties that correspond to data 
in unmasked confidential files. 

Disclaimer: This report is released to inform interested parties of research and to 
encourage discussion. The views expressed are those of the authors and not necessar- 
ily those of the U. S. Census Bureau. The author thanks Dr. Nancy Gordon, Dr. Cyn- 
thia Clark, and two reviewers for comments leading to improved wording and expla- 
nation and to several additional references. 
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Abstract. This paper provides an overview of methods of masking microdata 
so that the data can be placed in public-use files. It divides the methods accord- 
ing to whether they have been demonstrated to provide analytic properties or 
not. For those methods that have been shown to provide one or two sets of ana- 
lytic properties in the masked data, we indicate where the data may have limita- 
tions for most analyses and how re-identification might or can be performed. 
We cover several methods for producing synthetic data and possible computa- 
tional extensions for better automating the creation of the underlying statistical 
models. We finish by providing background on analysis-specific and general in- 
formation-loss metrics to stimulate research. 



1 Introduction 

This paper presents of an overview of methods for masking microdata. Statistical 
agencies mask data to create public-use files for analyses that cannot be performed 
with published tables and related results. In creating the public-use files, the intent is 
to produce data that might individuals to approximately reproduce one or two analy- 
ses that might be performed on the original, confidential microdata. Masking methods 
are often chosen because they are straightforward to implement rather than because 
they produce analytically valid data. 

There are several issues related to the production of microdata. First, if the public- 
use file is created, then the agency should demonstrate that one or more analyses are 
possible with the microdata. It may be that the best the agency can do is an ad hoc 
justification for a particular analysis. This may be sufficient to meet the needs of 
users. Alternatively, the agency may be able to refer to specific justifications that 
have been given for similar methods on similar files in previous papers or research 
reports. If methods such as global recoding, local suppression, and micro-aggregation 
have never been rigorously justified, then the agency should consider justifying the 
validity of a method. 

This is true even if it is in wide-spread use or in readily available generalized soft- 
ware. Second, the public-use file should be demonstrated to be confidential because it 
does allow the re-identification of information associated with individual entities. 
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The paper provides background on the validity of masked microdata files and the 
possibility or re-identifying information using public-use files and public, non- 
confidential microdata. Over the years, considerable research has yielded better meth- 
ods and models for public-use data that has analytic properties corresponding to the 
original, confidential microdata and for evaluating risk to avoid disclosure of con- 
fidential information. In the second section, we provide an elementary framework in 
which we can address the issues. We list and describe some of the methods that are in 
common use for producing confidential microdata. In the third section, we go into 
detail about some of the analytic properties of various masking methods. A method is 
analytically valid if he can produce masked data that can be used for a few analyses 
that roughly correspond to analyses that might have been done with the original, 
confidential microdata. A masking method is analytically interesting if it can produce 
files that have a moderate number of variables (say twelve) and allows two or more 
types of analyses on a set of subdomains. In the fourth section, we give an overview 
of re-identification using methods such as record linkage and link analysis that are 
well-known in the computer science literature. In the fifth section, we provide an 
overview of research in information-loss metrics model and re-identification meth- 
ods. Although there are some objective information loss metrics (Domingo-Ferrer 
and Mateo-Sanz [16], Domingo-Ferrer [15], Duncan et al. [20], Raghunathan et al. 
[48]), the metrics do not always relate to specific analyses that users may perform on 
the public-use files. There is substantial need for developing information-loss metrics 
that can be used in a variety of analytic situations. Key issues with disclosure avoid- 
ance are the improved methods of re-identification associated with linking adminis- 
trative files and the high quality of information in publicly available files. In some 
situations, the increased amount of publicly available files means that manual meth- 
ods (Malin et al. [39]) might be used for re-identification. To further improve disclo- 
sure-avoidance methods, we need to research some of the key issues in re- 
identification. The final section consists of concluding remarks. 



2 Background 

This section gives a framework that is often used in disclosure avoidance research 
and brief descriptions of a variety of methods that are in use for masking microdata. 
Other specific issues related to some of the masking procedures are covered in subse- 
quent sections. 

The framework is as follows. An agency (producer of public-use microdata) may 
begin with data X consisting of both discrete and continuous variables. The agency 
applies a masking procedure (some are listed below) to produce data Y. The masking 
procedure is intended to reduce or eliminate re-identification and provide a number of 
the analytic properties that users of the data have indicated that they need. The agency 
might create data Y and evaluate how well it preserves a few analytic properties and 
then perform a re-identification experiment. A conservative re-identification experi- 
ment might match data Y directly with data X. Because the data Y correspond almost 
precisely (in respects to be made clearer later), some records in X may be re- 
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identified. To avoid disclosure, the agency might apply additional masking proce- 
dures to data X to create data Y’. It might also extrapolate or investigate how well a 
potential intruder could re-identify using data Y” that contain a subset of the vari- 
ables in X and that contain minor or major distortions in some of the variables. After 
(possibly) several iterations in which the agency determines that disclosure is avoided 
and some of the analytic properties are preserved, the agency might release the data. 

Global Recoding and Local Suppression are covered by Willenborg and De Waal 
[66]. Global recoding refers to various global aggregations of identifiers so that re- 
identification is more difficult. For instance, the geographic identifiers associated 
with national US data for 50 States might be aggregated into four regions. Local sup- 
pression covers the situation when a group of variables might be used in re- 
identifying a record. The values of one of more of the variables would be blanked or 
set to defaults so that the combination of variables cannot be used for re- 
identification. In each situation, the provider of the data might set a default k (say 3 or 
4) on the minimum number of records that must agree of after global recoding and 
local suppression. 

Swapping (Dalenius and Reiss [9], Reiss [49], Schlorer [57]) refers to a method of 
swapping information from one record to another. In some situations, a subset of the 
variables is swapped. In variants, information from all records, a purposively chosen 
subset of records, or a randomly selected subset of records may be swapped. The 
purposively chosen sample of records may be chosen because they are believed to 
have a greater risk of re-identification. The advantages of swapping are that it is eas- 
ily implemented and it is one of the best methods of preserving confidentiality. Its 
main disadvantage is that, even with a very low swapping rate, it can destroy analytic 
properties, particularly on subdomains. 

Rank Swapping (Moore [41]) is another easily implemented masking procedure. With 
basic single-variable rank swapping, the values of an individual variable are sorted 
and swapped in a range of Upercent of the total range. A randomization determines 
the specific values of variables that are swapped. Swapping is typically without re- 
placement. The procedure is repeated for each variable until all variables have been 
rank swapped. If k is relatively small, then analytic distortions on the entire file may 
be small (Domingo-Ferrer and Mateo-Sanz [16], Moore [41]) for simple regression 
analyses. If k is relatively large, there is an assumption that the re-identification risk 
may be reduced. 

Micro-aggregation (Defays and Answar [12], Domingo-Ferrer and Mateo-Sanz [17]) 
is a method of aggregating values of variables that is intended to reduce re- 
identification risk. For single-ranking micro-aggregation in which each variable is 
aggregated independently of other variables, it is easily implemented. The values of a 
variable are sorted and divided into groups of size k. In practice k is taken to be 3 or 4 
to reduce analytic distortions. In each group, the values of the variable are replaced 
by an aggregate such as the mean or the median. The micro-aggregation is repeated 
for each of the variables that are considered to be usable for re-identification. Do- 
mingo-Ferrer and Mateo-Sanz [17] provided methods for aggregating several vari- 
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ables simultaneously. The methods can be based on multi-variable metrics for cluster- 
ing variables into the most similar groups. They are not as easily implemented be- 
cause they can involve sophisticated optimization algorithms. For computational 
efficiency, the methods are applied to 2-4 variables simultaneously whereas many 
public-use files contain 12 or more variables. The advantage of the multi-variable 
aggregation method is that it provides better protection against re-identification. Its 
disadvantage is that analytic properties can be severely compromised, particularly if 
two or three uncorrelated variables are used in the aggregation. The variables that are 
not micro-aggregated may themselves allow re-identification. 

Additive Noise was introduced by Kim [32, [33] and investigated by Fuller [28], Kim 
and Winkler [34], and Yancey et al. [71]. Let X be an nxk data vector consisting of n 
records with k variables (fields]. If we generate random noise Xj with mean 0 and 
cov(X|) = cov(X), then we can replace use Y = X H- 8 where cov(e) = c cov(Xj) for 
0<c<0.5. Further, we can do a linear transform of Y to Z so that the mean of Z the 
mean of X and the correlation of Z equals correlation of X. Kim [26] also showed 
that it is theoretically possible to recover means and covariances of X on arbitrary 
subdomains. Kim [32], [33] and Fuller [28] both showed that the additive-noise pro- 
cedures provide good analytic properties such as regression analyses with masked 
data Z that closely reproduce regression analyses of the unmasked data X. Because 
additive noise can yield files with moderate re-identification rates (Fuller [28]], Kim 
and Winkler [34]), Roque [54] introduced methods that applied mixtures of additive 
noise that reduced re-identification rates by a factor of 10. Yancey et al. [71] provided 
much simpler computational procedures that eliminated the nonlinear optimization 
procedures of Rogue [54]. One criticism has been the need of specialized software for 
producing and analyzing the data that is (partially) alleviated by high quality software 
that was developed for Yancey et al. [71]. 

Synthetic Microdata from Probabilistic Models (Fienberg et al. [24], [26], [27], 
Raghunathan et al. [48], Reiter [52], Little and Liu [37], [38], and Polettini [46]) 
provide methods for building an accurate statistical model M of data X. The statistical 
model M is generally based on estimates of joint and conditional distributions of the 
underlying probability densities. The estimation of densities is simplified because the 
models typically only target one set of analyses. Artificial or synthetic data Y are 
created by randomly drawing records from the model M. The intent is that some of 
the analytic properties of M are preserved in the artificial data Y. It is generally as- 
sumed that re-identification is impossible even when a number of analytic restraints 
are placed on the data. A clear disadvantage of synthetic data is the amount of model- 
ing expertise that is needed for developing a reasonable model M of the data with 
suitable analytic properties. Another disadvantage is that a moderate to large amount 
of high quality data may be needed in the modeling. If there are errors in the original 
data, then there is a possibility that the errors will be approximately reproduced in the 
synthetic data. An advantage is that the one or two potential analytic uses of the data 
can be clearly described. 
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3 Analytic Properties of Microdata 

There are several issues with regards to the production of microdata with analytic 
properties. The first issue is whether the analytic properties of the masked data are 
justified. Does the masked microdata allow a user to approximately reproduce one or 
two analyses that could be performed on the original microdata other than simple 
tabulations (that are often in published tables)? Is there sufficient detail so that the 
user can be aware of any potential discrepancies that may occur in an analysis that the 
producer has stated can be performed with the masked microdata? The second issue 
is whether the public-use microdata yields analytic results that differ significantly 
from results that would be obtained using the original microdata. For instance, in an 
econometric analysis, an economist might not find relationships between certain vari- 
ables where relationships exist between the variables in the original microdata. Addi- 
tionally, there may be too many or too few individuals in high income categories in 
comparison with the original microdata. 

With the exception of additive noise and synthetic microdata, there has seldom 
been much work that describes the analytic properties of the masked microdata. Gen- 
erally, additive noise produces masked microdata that can quite accurately reproduce 
regression analyses (even on arbitrary subdomains that are reasonably large). If the 
constant c used with additive noise is close to zero, then additive noise will often not 
distort the original data much and allow other statistical analyses. A deficiency of 
additive noise as suggested by Fuller [28] and shown by Kim and Winkler [34] is that 
small proportions of records (0.5% - 3%) might be re-identified. Kim and Winkler 
[34] applied additional masking procedures to avoid disclosure and somewhat de- 
crease valid analytic properties of the public-use microdata. We observe that it is 
possible that re-identification rates go up as more analytic restrictions are placed on 
the masked data. For instance, many users want to do analyses on subdomains. If the 
microdata have detailed geographic identifiers (Elliot et al. [21], [22]) or subdomains 
associated with sparse age-race-sex categories are created, then re-identification rates 
can increase. 

Creation of synthetic data from valid models is appealing because it potentially can 
preserve some analytic properties. It is assumed that synthetic data cannot be re- 
identified. An intuitive difficulty of synthetic data, as shown by Reiter [52], [50], is 
that any analytic properties that are not included in the model will not be in the syn- 
thetic data. Reiter [52] has shown that some of the analytic properties that are in the 
model may not be in the synthetic microdata due to problems with the original micro- 
data. For instance, the nature of the random number generation process and sample 
sizes, the type of modeling procedures used, outliers, and errors in the original micro- 
data can all affect the quality of the produced microdata. Additionally, it is not possi- 
ble to have greater accuracy for analytic purposes in the masked microdata than in the 
original, confidential microdata. 

Reiter [51], [52] has used standard modeling methods that are often used in the 
multiple imputation literature for creating models for the data. Polettini [46] has used 
maximum entropy methods for creating models of the data. Muralidhar et al. [42], 
[43] and Sarathy et al. [55] have used copolas to model distributions. Due to the diffi- 
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culty (both theoretically and computationally) of creating models for data, Thi- 
baudeau and Winkler [62] have suggested using Bayesian Networks because high 
quality software is available for (semi-)automatically creating models of the data. The 
Bayesian Network methods can be used only with discrete data or with models in 
which some of the continuous data can be reasonably approximated by a discrete data 
representation. Although the models are likely to have lower quality than those of 
Reiter [52] and Polettini [46], they are much easier for the statistical agencies to im- 
plement. High quality software is often available for creating the Bayesian Network 
[62]. Dandekar et al. [10], [11] use Latin Hypercube methods for creating synthetic 
data. They have demonstrated that the methods can rapidly produce large synthetic 
data sets with a considerable number of variables. Further, when the Latin Hypercube 
methods do not produce synthetic data with the desired accuracy for some analyses, 
they have iterative refinement methods for improving the analytic properties. 

With other masking methods for producing microdata, there is often no justifica- 
tion that the masking method produces masked microdata with one or two of the 
analytic properties of the original microdata. The superficial exceptions are when the 
producers of the masked microdata demonstrate that the masked microdata repro- 
duces several of the tabulations of the original microdata. The producers may also 
include reproduction of tabulations on a few subdomains. At a minimum, we would 
hope that the masked microdata allows reproducing tabulations on the entire file and 
on certain subdomains. Fuller [28] and Lambert [35] have indicated that masked 
microdata should preserve the first two moments of the original, confidential micro- 
data and at least one other statistical property. Van Den Hout and Van Der Heijden 
[64] have shown how a few analytic properties can be preserved with the PRAM 
method (see e.g., Willenborg and De Waal [66]) for producing masked, discrete data. 
PRAM uses a Markovian strategy for swapping values of a variable across various 
records. The intent is that certain marginal distributions are approximately preserved. 
A potential intruder can never be certain what values in a given record have been 
changed. 



4 Re-identification of Microdata 

The highest standard for estimating the proportion of records that can be re-identified 
(Lambert [35], Domingo-Ferrer and Torra [19]) is where record linkage is used to 
match information from public data with the masked microdata. Alternative effective 
matching methods are the nearest neighbor methods (Domingo-Ferrer and Mateo- 
Sanz [16]) and the clustering algorithms of Bacher et al. [5]. Record linkage can also 
be used in the conservative framework described in section 2 in which a masked file 
of Y data is matched directly with the original, confidential file of X data that was 
used in creating it. In some situations, it may be possible to accurately match 0.5-2% 
of the records in the initial iteration of the potential masked microdata. In this situa- 
tion, the file is non-confidential and we would proceed to further masking to provide 
disclosure avoidance. The ability to link public information to public-use files has 
further been compromised by the increased sophistication of record linkage (Yancey 
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et al. [71] and link analysis methods (McCallum and Wellner [40], Bilenko et al. [7]) 
and the increased availability of high-quality public data. Indeed, due to the quality of 
public information, Malin et al. [39] are often able to match public information using 
purely manual methods. 

Re-identification of microdata refers to the ability to use publicly available infor- 
mation to attach names, addresses, and other (semi-)unique identifiers to individual 
records in a public-use file. An identifier is semi-unique if it cannot exactly identify a 
linkage between two records using the identifier by itself but can allow a re- 
identification in combination with other variables. For instance, in a file where there 
are many individuals with the name “John Smith”, “John Smith” is a semi-unique 
identifier. US Postal ZIP code and income may be other semi-unique identifiers for 
an individual. When the semi-unique identifiers such as name “John Smith”, ZIP 
code, and income are used in combination, they may uniquely determine a linkage 
between two records. Among the records having name “John Smith” and ZIP code 
“98100,” we may define a metric that allows the income to be matched if the incomes 
are within 5% of each other. The deviation in the metrics can often be modified ac- 
cording to the types of files being matched, the amount of redundant information that 
is available for matching, and the stated analytic properties of the public -use micro- 
data. For instance, if we had an additional variable such as profession, it is likely we 
could re-identify even when the metrics allow significantly larger deviations in in- 
come and other characteristics between matching records. 

In record linkage (Winkler [67], [68]) and in re-identification experiments (Yancey 
et al. [71]), the software is designed to deal with both minor and major errors in a 
subset of the variables used in matching. In most realistic real-world settings, some of 
the continuous variables are available in external files and can be used in matching. 
Further, it is straightforward (Scheuren and Winkler [56]) to define new metrics 
based on relationships between two correlated or related variables to improve match- 
ing accuracy significantly. For instance, we might consider the situation of linking 
two administrative lists of companies in which name and address information are 
sufficiently poor so most companies in one file might be associated with upwards of 
fifteen companies in another file. If we also have income on one file, receipts on other 
file, and a crude function ‘f(income) = receipts’ that relates income to receipts, then 
we may be able to significantly increase high quality match rates with the extra, non- 
name-and-address information. In the situation with a masked file Y, we may have a 
file Y' that only has a subset of the variables that in the Y file and the subset of Y' 
variables is not sufficient for accurate matching. If we can find an extra file Z in 
which the Z values can be used to predict of the Y variables that are missing from the 
Y' file, then we may have sufficient extra information for significantly improving the 
matching (and possible re-identification) [69]. 

Because many individuals are unable to perform sophisticated or elementary re- 
cord linkage, a number of other re-identification risk measures have been defined. 
The first measure is the number or proportion of population uniques in a file. The 
measure appears to be based on elementary ideas in survey sampling for which 
sort/merge utilities can be used to determine matches on an exact character-by- 
character basis. A unique is a record with identifying information that distinguishes it 
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from other records. If records are tabulated according to their identifying information, 
then uniques are those records that have frequency one. There are elementary models 
that use the frequency distribution of the identifying characteristics in the sample to 
obtain estimates of the number of the population uniques in the sample. The intuition 
is that each population unique might be identified using elementary sort/merge meth- 
ods. A key observation (Elliot et al. [21], [22]) is that a sizable proportion of popula- 
tion uniques in a file with twelve or more variables may be uniquely determined by 
three, four, or five variables in the record associated with the unique. These special 
uniques may be more easily re-identified because the intruder may be more likely to 
have the smaller number of variables in an external file. 

There is a class of risk-estimation measures that are based on statistical models in 
which the number of uniques in the public-use sample is used to provide an estimate 
of the number of uniques in the original population file. The methods apply quite 
sophisticated models that relate the distribution of the sample uniques to the distribu- 
tion of the population uniques. Early papers with these statistical risk measures are by 
Bethlehem et al. [6], Eienberg and Makov [25], Benedetti and Eranconi [5] and Skin- 
ner and Holmes [59]. Later papers with enhanced methods are due to Skinner and 
Elliot [58], Rinott [53], and Polettini and Stander [47]. The apparent intent is to pro- 
vide a straightforward, rapid method of estimating the proportion of sample uniques 
that are also population uniques. We will refer to these methods as sample-unique- 
population unique (SUPU) methods. 

There are four obvious criticisms of these SUPU methods. The first criticism is 
that it is straightforward for the statistical agency to determine the risk of disclosure 
by directly comparing the sample file with the population file from which it is pro- 
duced. Indeed, determining the number of population uniques that are also in the 
sample file can be part of the sampling procedure. The second criticism is that a pub- 
lic-use file will typically contain ten or more variables, both discrete and continuous. 
Even if the continuous variables are broken into a large number of discrete ranges, it 
is likely that all or almost all of the records in the sample and many or most of the 
original population records will be unique. The methods that have typically been 
applied only use a few discrete variables under the assumption that the intruder will 
only have access to a few discrete variables. This assumption appears to be naive 
given the amount of information that is available from internet and other sources 
(Sweeney [60], [61]). Because most if not all public-use data files contain continuous 
variables, it is unrealistic not to include them in the statistical models of disclosure 
risk. The third criticism is that the statistical models only provide an estimate of the 
proportion of sample uniques that are also population uniques. The methods do not 
determine which of the sample uniques can be re-identified as would happen in a 
record linkage experiment (Kim and Winkler [34], Sweeney [60], Winkler [68], Do- 
mingo-Ferrer and Torra [19]). The fourth criticism is that the models are often se- 
verely biased with the bias varying according to the data source. For instance, if we 
were to produce a public-use file according to the following two procedures, we 
would obtain severely biased answers from all of the SUPU models. In the first situa- 
tion, we could sample only from population uniques in producing public-use file Dj. 
In the second situation, we would sample only from population records that are not 
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unique (occurring two or more times according to identifying information) in such a 
manner that every record in the puhlic-use file is a sample unique. In each of these 
situations, every SUPU model would give very biased estimates of the re- 
identification risk. In general situations where different sampling procedures might be 
used, we would still likely have subsets of the public-use file where the biases could 
approach the two extreme situations previously described. 

A second metric associated with re-identification (disclosure) risk for a one- 
variable situation is the multiplicative inverse of the conditional probability Var (XJ 
(Duncan et al. [20]) where the variance is that of an intruder with weak knowledge. 
For instance, if data are micro-aggregated into a group on n items and each item is 
given the average value X , the re-identification risk is n / (n - 1) C7 where (jis the 
variance of the original x-values. Trottini and Fienberg [63] have extended this metric 
to a few two-dimensional situations and have noted that the metric is different from 
many of the other metrics that are commonly in use. 

In addition to using methods such as record linkage or nearest-neighbor matching, 
Lambert [35] and Palley and Simonoff [45] have shown how knowledge of the ana- 
lytic properties (such as regression parameters) of a masked data file can allow re- 
identification. De Waal and Willenborg [13], [14] show how knowledge of sampling 
weights can allow re-identification. If we have detailed information about the survey 
frame, the sampling design, and the valid uses of the sampled data, then the sampling 
weights give us useful information about subdomains in which a record may occur. If 
the sampling weights are combined with the original sampled data or with masked 
data, we may have sufficient information to allow some re-identifications. 

There are several research questions. In many instances, can methods such as re- 
cord linkage, nearest neighbor, or clustering be used for re-identification? What data 
will be available on public sources such as the internet and can be used for re- 
identification? Can the ideas of constructing functional relationships be substantially 
extended to allow re-identification in many situations? How can advanced methods 
such as link analysis be applied for re-identification? 



5 Discussion of Information Loss Metrics 

The best situation for the producer of public-use microdata is when there is one set of 
users of the microdata with explicitly stated analytic needs. The key point is that, if 
there is a set of clearly defined users, the data provider can use ad hoc procedures that 
are specific to a small set of required types of analyses. Generally, individuals have 
desired more objective information loss metrics that allow comparison of results 
across several types of analyses or types of files. Before describing some of the at- 
tempts at objective information-loss metrics, we describe the situation with clearly 
user-defined needs. 

Kim and Winkler [34] had to produce masked data that preserved analyses in a 
clearly set of subdomains determined by age, race, sex, and one other variable. In 
their situations, they as producers were able to iteratively negotiate the potential ana- 
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lytic uses. Because some of the initial domains were too small, they were able to get 
users to agree to some collapsing of ages into age ranges and some of the subdomains 
determined race-age-sex categories. Based on the consultation with users, they ini- 
tially created subdomains that had a sufficient number of records. They then applied 
additive noise with various levels of noise to determine the analytic degradations due 
to increased noise and determined a noise-level at the user-specified level of accu- 
racy. Finally, they applied a swapping strategy for the most easily re-identified re- 
cords. The swapping strategy was designed to preserve regression properties in the 
user-specified subdomains. After completion of the masking procedures, they pro- 
vided information about analytic degradations in some subdomains that were not 
considered as important as the main set of subdomains and described specifics of the 
testing to determine that disclosure was avoided. 

Willenborg and De Waal [66] have suggested using entropy with discrete data. 
The possible intuition is that the masked data may have decreased entropy and be less 
useful. We observe, however, X = (Xj, ...., xj is discrete data, then a collapsing strat- 
egy may produce masked data Y = (yj, ..., y_„) where the sum of the counts in the y- 
cells exceeds a lower bound (say 3) and where the sum of the counts agrees the sum 
of the counts of the x-cells. Each of the y-variables is obtained by aggregating x- 
variables. Although entropy clearly decreases, it is not clear how a loglinear analysis 
might be affected. In the best situation, the collapsing may account for the sufficient 
statistics in original x-variables and the y-variables allow reproduction of an analysis. 
Domingo-Ferrer and Torra [18] have observed additional difficulties with the use of 
entropy. In other situations, a loglinear analysis that is possible on the x-variables is 
impossible on the y-variables. In other words, entropy is unlikely to provide useful 
information about possible degradations in analyses. Iyengar [30] provides an exam- 
ple of a collapsing strategy that allows a classification problem to be (approximately 
and accurately) reproduced. Sweeney [60], [61] provides additional details on col- 
lapsing methods and strategies for producing k-anonymity. A file is k-anonymous if 
the identifiers agree exactly the identifiers in at least k-1 other records. 

Willenborg and De Waal [66] have suggested using as an information loss metric a 
statistic that they refer to as the variance inflation statistic 

/V /V /V 

Var(9 I Dq ) / Var(9 \ D ) where 0 is a univariate statistic, D„ is the original data, 

and D is the masked data. We observe that if the data are masked via micro- 
aggregation, then this statistic increases. If the data are masked by additive noise, 
then this statistic decreases. 

Duncan et al. [20] have suggested using as information loss metric the statistic 
7/Dj where is the utility of a univariate statistic. For instance, if a file containing 
one variable X = (Xj, ...., x_^) is masked by replacing the original values Xj in each of 
the records with the average X , then the variance of the data is the utility D. If the 
variance increases, then the utility decreases. The ideas of Duncan et al. [20] are in- 
tended to cover both information loss and disclosure risk for a file containing one 
variable. Trotini and Fienberg [63] have shown how to extend their ideas to a few 
two-dimensional situations with multivariate normal data. 
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Agrawal and Aggarwal [2] provide a measure of information loss that may be suit- 
able for further research. Let (x) be the density estimated from the masked data 
Y. Let (x) be the density from the original data X. Agrawal and Aggarwal [2] 

estimate (x) in the one-dimensional privacy-preserving situation of Agrawal and 
Srikant [3] in which a simple type additive noise used for masking. In more general 
situations, the densities could be associated with multivariate data that has been 
masked via other methods. Then the information loss in the masked data is given by 

J 

This metric equals half the expected value of the L^-norm between the original dis- 
tribution (x) and the estimate (x) . The information loss I (/^ , ) takes 

values between 0 and 1. If / (/^ , ) =0, then the estimate (x) perfectly recon- 

structs {x) . An issue with reconstructing the density (x) is the amount of data 
needed and the accuracy of the estimate (x) . It may be more difficult to obtain 

the estimate (x) than to do a direct comparison, say, of two regression analyses 
using the masked and original data. This concern applies to all of the information-loss 
situations involving synthetic data (given below). 

Gomatam and Karr [29] provide a review and empirical comparison of six infor- 
mation-loss metrics for discrete data from the statistical and other literature. In the 
following / and g are two discrete distributions where the first might be associated 
with the unmasked data X and the second with the masked data Y. They use Hellinger 
distance whose definition is 

They use the discrete analog of the distance metric of Agarawal and Aggarwal [2] 
that they refer to as total variation. They use change in entropy that has been used by 
Willenborg and De Waal [66] and Domingo-Ferrer and Mateo-Sanz [18]. They use 

2 

Cramer’s V measure of association on the ^ statistic for an mxn contingency table 
that is defined as 

V = (z^ /(N min(m - 1, n - 1))'^^ , 

2 

where 2" is the usual test of independence. They also use Pearson’s contingency 
coefficient C that is defined as 






C = (z^l(Z^+N)f\ 
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In the empirical work, Gomatam and Karr [29] apply the information loss metrics 
to data in which swapping has been performed (Dalenius and Reiss [9]). Willenborg 
and De Waal [66] define a data swap of 2k elements in terms of k elementary swaps. 
In an elementary swap we first make a random selection of two records i and j from 
and then interchange of the values of the variables being swapped for these two re- 
cords. The swap proportion or rate is defined as 2k/N where N is the number of re- 
cords. Gomatam and Karr observe that each of the metrics generally increases with 
the increase in the swap rate and that they are quite correlated. 

Within the multiple-imputation framework of Raghunathan et al. [48], Little and 
Liu [37], [38] have also considered an information loss metric for univariate statistics. 
Before giving the statistic, we need to define some terms. We are interested in the 
scalar parameter ^ . For any completed data set d = (d = 1 , • • • , D) among D cop- 
ies of the data obtained from randomly drawing from the model, let (j)^ denote an 

estimate of (j) and an estimate of the variance of (j)^ . The MI estimate of (j) is 
given by 

T = W + B!D 

where W = ^ ^ ^ ~ Sj-i ^ ^ — 1 ). Then the information 

loss associated with the scalar (j) is given by y =(B I D)T . Since B and T are 
likely to be bounded, the information loss y goes to zero as the number of replicates 
D increases. The metric y is appealing. If the analysis is based on a highly accurate 
model M, then the model M provides estimates of scalars for which information loss 
can go to zero. As MI is a general framework, it would be useful to extend the infor- 
mation loss to multivariate situations and provide a number of examples. 

Because of the difficulties in creating detailed models M, Little and Liu [38] and 
Reiter [51] only develop partial models that are applied to a subset of the variables in 
a data file. Alternatively, Kennickell [31] creates a MI model for data and iteratively 
blanks and fills values of variables to create a file of synthetic data. The iterative 
cycling between blanking data and filling in data stops when the data are believed to 
have converged or sufficient cycles have taken place. The ideas of Kennickell have 
been adapted and extended by Abowd and Woodcock [1]. 

The research questions for information loss are particularly difficult? Is it possible 
to create models that represent data in a form that allows or account for several analy- 
ses? With different models, is it possible to have metrics for information loss that 
relate to several analyses? How does a statistical agency create public -use files that 
preserve analytic properties and provide statistical justifications of the limitations of 
the public-use data? 
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6 Concluding Remarks 

This paper provides an overview of a number of methods that are in common use for 
the production of public -use microdata. It covers analytic uses of the microdata and 
some of the research problems in information-loss metrics. It also covers methods of 
evaluating re-identification risk. 

Disclaimer: This report is released to inform interested parties of research and to 
encourage discussion. The views expressed are those of the authors and not necessar- 
ily those of the U. S. Census Bureau. The author thanks Dr. Nancy Gordon, Dr. Cyn- 
thia Clark, and two reviewers for comments leading to improved wording and expla- 
nation and several additional references. 
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Abstract. When microdata files for research are released, it is possible 
that external users may attempt to breach confidentiality. For this reason 
most National Statistical Institutes apply some form of disclosure risk 
assessment and data protection. Risk assessment first requires a measure 
of disclosure risk to be defined. In this paper we build on previous work 
by [BF98] to define a Bayesian hierarchical model for risk estimation. 
We follow a superpopulation approach similar to [BKP90] and [RinOS]. 
For each combination of values of the key variables we derive the poste- 
rior distribution of the population frequency given the observed sample 
frequency. Knowledge of this posterior distribution enables us to obtain 
suitable summaries that can be used to estimate the risk of disclosure. 
One such summary is the mean of the reciprocal of the population fre- 
quency or Benedetti-Franconi risk, but we also investigate others such as 
the mode. We apply our approach to an artificial sample of the Italian 
1991 Census data, drawn by means of a widely used sampling scheme. 
We report on results of this application and document the computa- 
tional difficulties that we encountered. The risk estimates that we obtain 
are sensible, but suggest possible improvements and modifications to 
our methodology. We discuss these together with potential alternative 
strategies. 

Keywords: Bayesian hierarchical model, disclosure risk, empirical 
Bayes, Hypergeometric function, individual risk, Mathematica, risk as- 
sessment, social survey data, summarising posterior distributions 



1 Introduction 

When microdata files for research are released, it is possible that external users 
may attempt to breach confidentiality. For this reason most National Statistical 
Institutes apply some form of disclosure risk assessment and data protection. 
Risk assessment first requires a measure of disclosure risk to be defined; as this 
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is usually cast in terms of population quantities, risk estimation is then achieved 
by introducing suitable statistical models. If the estimated risk is considered not 
tolerable, protection measures must be put into practice. 

In this paper we base our definition of disclosure on the concept of re- 
identification; see [WdWOl]. Therefore by disclosure we mean a correct record 
re-identification operation that is achieved by an intruder when comparing a tar- 
get individual in a sample with an available list of units that contains individual 
identifiers such as name and address. 

Even when attention is focused only on re-identification disclosure, different 
approaches to risk assessment can be pursued. For instance, global risk measures 
can be defined that allow us to screen out unsafe data releases; see, for example, 
[FM98] , [DL89] , [BKP90] , [Lam93] , [SE02] , and [Car02] . Alternatively, individual 
or combination- level risk measures, as defined in [BF98], [SH98], [Car02], and 
[ES04] among others, can be exploited to identify and protect unsafe records 
before the microdata file is released. A routine for computing a measure of indi- 
vidual risk of disclosure is now implemented in the software /i- Argus, developed 
under the European Union project CASC on Computational Aspects of Statisti- 
cal Confidentiality. For a comprehensive approach that integrates both individual 
and global measures, see [Pol03] . 

In social surveys, the observed variables are frequently categorical in nature, 
and often comprise publicly available variables, such as sex, age, region of resi- 
dence. Variables such as these that may allow identification and are accessible 
to the public are referred to as key variables. In such a framework, risk is usually 
defined as a function of combinations of values of key variables. These correspond 
to a contingency table built by cross-tabulating the key variables. Records pre- 
senting combinations of key variables that are unusual or rare in the population 
clearly have a high disclosure risk, whereas rare or even unique combinations in 
the sample do not necessarily correspond to high risk individuals. 

[BF98] introduced a Bayesian framework to estimate a record-level mea- 
sure of re-identification risk. They noticed that 1/Fk is the probability of re- 
identification of individual i in cell k, k = 1, . . . , K, when Fj; individuals in the 
population are known to belong to this cell. The number K of combinations in the 
population is assumed known. In order to infer the population frequency Fk of a 
given combination from its sample frequency fk, they then focused on the poste- 
rior distribution of Fk given fk; see also [FM98]. Finally, the Benedetti-Franconi 
risk is defined as the expected value of 1/Ffc under this distribution. This pro- 
posal aroused a large debate that resulted in a series of papers by [DCFS03], 
[Pol03] and [Rin03] . In this paper we build on previous work to define a Bayesian 
hierarchical model for risk estimation. We derive the posterior distribution of the 
population frequency for each combination of values of the key variables given 
the observed sample frequencies. Knowledge of this distribution enables us to 
obtain suitable summaries that can be used to estimate the risk of disclosure; 
one such summary is F[l/Fk\fk], but different summaries of the distribution, 
such as the mode or the median, are considered. The methodology adopted in 
the paper follows a superpopulation approach similar to that used in [BKP90], 
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where a Poisson-gamma model is first proposed; [SH98] suggest instead using 
a Poisson-lognormal model. A different, yet related procedure is described in 
[Car02] and [ES04], 



1.1 Some Additional Notation 

Let the microdata file be a random sample of size n drawn from a finite pop- 
ulation of N units, where N is assumed known. For a generic unit i in the 
population, we denote as the probability that i is included in the sample. 
For each record i, the released file contains a set of key variables and other 
variables. Let 

TTfc = P(a member of the population falls into cell A:) > 0 



and 



Pk = P{& member of population cell k falls into the sample) > 0 . 

As we want to infer the population frequencies from the sample frequencies, a 
distribution of interest is the posterior distribution of Fk given fk, the probability 
mass function of which we shall write as [Pfe|/fc]. q Another useful distribution 
will be the marginal distribution of the observed data fk, the probability mass 
function of which we shall write as [fk] ■ 

1.2 Outline of the Paper 

In Sect. 2 we define the Bayesian hierarchical model upon which our methodology 
is based. We give expressions for the probability mass functions [Fk\fk] and [fk] 
in Sect. 3, and present the log-likelihood function associated with [fk]. Since 
both [Fk]fk] and [fk] involve the Hypergeometric function, we also discuss some 
issues concerning the computation of that function. In Sect. 4 we describe the 
application of our methodology to data and discuss the results that we obtain. 
In Sect. 5 we present our conclusions and further discussion. Technical results 
are collected together in the Appendix. 



2 Defining the Model 



Our approach to risk estimation is based on a Bayesian hierarchical model. We 
assume that the tt^s are drawn independently from a gamma(a. A) distribution, 
in which a and A are unknown positive parameters. This means that the prob- 
ability density function of tt^, is 



= A“7rr'e-^-V^(«) , 



in which F{a) is the gamma function F{a) = a;“ ^dx. We impose the 



constraint that E 



l^k=i 



= 1, with the result that X = Ka. 
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We assume that Fk is drawn from a Poisson (TY tt^) distribution. This means 
that the probability mass function of Fk given is 

[FkW] = {N7Tkf'‘ /Fk\ . 

Next, we assume that the pkS are drawn independently from a beta(afe,6fc) 
distribution, where ak and bk are unknown positive, possibly cell-specific param- 
eters. This means that the probability density function of pk G (0, 1) is 

[Pk]=pt'^~^ (l-Pk)'”‘~^ /B{ak,bk) , 
in which B{ak^ bk) is the beta function 
1 

B{ak,bk) = j = r{ak)r{bk)/r{ak + bk) • 

0 

Finally, we assume that fk is drawn from a binomial(Ffc,Pfc) distribution. 
This means that the probability mass function of fk given given Fk, TTk and pk 
is 

[fk\Fk,T^k,Pk] = , 

which does not depend on TTk- 

Overall, our model takes the following form: 

TTk ^ gamma(a, Ka), tt^ > 0, k = 1, . . . , K, independently, 

~ Poisson (A^TTfe), Ffc = 0, 1,..., independently, 

Pk ~ beta(afe,&fc), 0 < pk < 1, independently, 

/fc|-Ffe,7rfc,_Pfc ~ binomial (Ffe,pfc), fk = 0,1, . . . , Fk, independently. 

We note that if a — >■ 0, we have [tt^] — >• [ttfe] oc I/tta,. This is the distribution 
for TTk implicitly assumed by [BF98]; see [Pol03] and [Rin03] for more details. 
In Sect. 5 we propose a model that relaxes this assumption of independence. 
Our strategy involves using [fk] to obtain maximum likelihood estimates of the 
unknown parameters and then estimating the risk of disclosure for cell k through 
[Fk\fk]- In Sect. 3 we present formulas for these probability mass functions. 



3 Two Useful Distributions, an Associated Log-Likelihood 
Function and Computational Issues 

In Sect. 3.1 we give expressions for [U|/fe] and [fk]- The associated log- likelihood 
function that we will need for our maximum likelihood approach is given in 
Sect. 3.2. In Sect. 3.3 we briefly discuss computational issues. 
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3.1 Two Useful Distributions 

The expressions for [Fk\fk] and [fk] involve the 2 Fi{A, B]C] z) Hypergeometric 
function which in its integral representation is defined as: 

1 

2 Fi{A, B] C; z) = J - tz)-^dt , (1) 

0 

for 5i(C) > > 0; see [AS65]. 

Distribution 1. The probability mass function of Fk given fk is 

[Fk\fk] = 

{Ka)'^+Fr{ak + bk + fk) ^ 

F{bk)F{a + fk) 2F1 (a + fk, Ok + fk', ak + bk + fk', 
jyf.-A rja + Fk)F{Fk - fk + bk) 

{N + Ka)^+P^ F{ak + bk + Fk)F{Fk -fk + l) ’ 

Fk = fk, fk + ^, - ■ ■ ■ 

Distribution 2. The probability mass function of fk is 
[fk] = 

Fjbk) r{a + fk)F{ak + fk) 

r{a)B{ak,bk) \Ka) F{fk + l)F{ak + bk + fk) 

( N \ 

2 F 1 [a + fkjO'k + fk'jO'k + bk + fk', j , 

fk = 0,1,... . 

These results can be obtained from the negative binomial distributions [Fk\pk, fk] 
and [/fe|pfc] as stated in [Rin03], for example, by multiplying by [pfe|/fc] and [pk] 
respectively and integrating. Full details of these derivations can be obtained 
from the authors. 



3.2 The Log- likelihood Function 

We can now obtain the log- likelihood function of a, oi, . . . , uk and b\, . . . ,bK 
given data fi,...,fx as Yfk=i log[/fc]- Up to an additive constant this can be 
written as 



L{a,ai, . . . ,aK,bi, . . . ,bK) = -KlogF{a) 

^ r 

+ ^ j log F{ak + bk) -logF{ak) + log F {a + fk) 

k=l ^ 

+ log F{ak + fk)- log F{ak + bk + fk) ~ fk log a 



+ log 2 F 1 



+ fk, cik + fk', Ok + bk + fk', 




(2) 
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If we make the simplifying assumption that oi = • • • = gk = a and bi = 
= bx = b, the log-likelihood function (2) becomes 



L{a, a, b) = -K {log /"(a) - log F{a -I- 5) -I- log /"(a)} 

K 



+ ^ I log r(a -h fk) + log r{a + fk) - log r{a + b+ fk)~ fk log a 

k=l ^ 



N 

+ log 2 F 1 I Oi + fk,Gk + fk', CLk + bk + fk', 

' Ka 



(3) 



up to an additive constant. 



3.3 Computational Issues 

The evaluation of the Hypergeometric function 2 F 1 presented us with consider- 
able difficulties. We experimented with a variety of software such as Mathematica, 
Maple and FORTRAN. We found that Mathematica produced the most reliable and 
consistent evaluations of 21 ^ 1 . For speed of evaluation we developed a Laplace ap- 
proximation to 2 F'i . This is discussed in detail in Appendix 5. Whenever possible 
the results that we report here were obtained using the highest available accu- 
racy, as offered by Mathematica. Often the results that we obtained employing 
the approximation were very similar to those obtained with Mathematica. 

4 Applications to Data 

We describe the results obtained when we applied the proposed methodology to 
an artificial sample of data drawn from the Italian 1991 Census. In Sect. 4.1 we 
describe the data and sampling design, while in Sect. 4.2 we discuss our results 
and the problems we faced when applying the methodology. We also suggest 
some possible improvements. 

4.1 Description of the Available Data 

We work with an artificial sample of n = 53, 872 records drawn from the 1991 
Italian Census data according to the sampling scheme of the Labour Force Sur- 
vey, as described in [DCFS03]. 

The data come from four administrative Italian regions, namely Campa- 
nia, Lazio, Val d’Aosta and the Veneto. The total number of individuals in the 
population from these four regions is N = 15, 142, 320. Among the many vari- 
ables collected in the Census, we chose the following as key variables: sex (2 
categories), age (recoded in 14 classes), region of residence (the 4 regions just 
mentioned), position in profession (14 categories) and relationship with the head 
of the household (13 categories). Since this is an instance where the population 
cell frequencies Fk are known, the data allow us to assess the proposed procedure 
by comparing known population quantities with their corresponding estimates. 
Moreover, we can exploit the sampling design information and estimate the 
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population cell frequencies using sampling design weights as 
which Ck is the set of records in the sample that belong to class k. We remark 
here that the sampling weights are calibrated to match known population totals 
on a set of auxiliary variables, that need not be the same as the key variables. 
For details, see [DS92] and [DCFS03]. 

4.2 Results 

We began by attempting to estimate the parameters a, a and b by maximis- 
ing the log-likelihood (3). Numerical evaluation of the profile log-likelihood of b 
for fixed (a, a) indicated that this function is increasing with b. Moreover, we 
found that for high values of b, the appropriate Hypergeometric function (2) 
cannot be accurately evaluated. We also noticed that for fixed b the likelihood 
shows a regular behaviour, exhibiting a clear maximum. For the reasons just 
mentioned, we decided to impose an upper bound for b, to set b to this upper 
bound, and finally to estimate (a, a) by maximising the profile log- likelihood 
L(a,a) = L{a,a,b). Numerical estimates were obtained using the FindMinimum 
routine provided by Mathematica. For increasing values of 6, the maximum like- 
lihood estimates of {a, a) slightly increase, with a being affected the least. 

The probability mass functions [fk] and [Fk\fk] can now be estimated by 
plugging-in the maximum likelihood estimates of the parameters. The results 
that are based on [Ffc|/fc] were obtained using the approximation procedure de- 
scribed in Appendix 5. 

We can assess the maximum likelihood estimates in two ways. We may per- 
form goodness of fit analysis on the observed data by comparing the fitted prob- 
ability mass function [fk] with the histogram of the sample cell frequencies. 
Alternatively, we may compare the expected values of Fk given fk under the es- 
timated model [Fklfk', a,a,b] with the corresponding population frequencies Fk, 
or with their estimated sampling design based counterparts F^ . A comparison of 
the performance of the maximum likelihood estimates (d, a) in terms of each of 
the above criteria has shown that both fit and prediction improve with increasing 
values of b, a result that is in accordance with our observation that the likeli- 
hood is an increasing function of b. Due to the above mentioned computational 
problem connected with the Hypergeometric function, the highest value that we 
were able to select for b was b = 80. The corresponding profile log-likelihood 
function is plotted in Fig. 1. The maximum of the function obtained by numer- 
ical maximisation is located at the point {a, a) = (0.92,0.86), indicated by a 
dot in the figure. The slight kink in one of the contours is connected with the 
numerical difficulties that Mathematica sometimes experiences when evaluating 
the Hypergeometric function. 

As a first assessment of the procedure that we adopted to select b, we show 
in Fig. 2 the estimated values of Fk obtained by the method we propose. In the 
first row, these estimated values are compared with the sampling design based 
estimates F^ using 6 = 80 (left panel) and b = 0.5 (right panel). The graphs show 
that the choice 6 = 80 is to be preferred, and that our method produces estimates 
that have approximately the same behaviour as those based on the sampling 
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Fig. 1. Contour plot of L{a, a, 80) given in (3). We show a on the x-axis and a on the 
y-axis. A maximum is visible at around (a, a) — (0.92, 0.86) indicated by a dot 



design. Our estimates, however, show less variation than F^, the variation in 
which is probably induced by the sampling weights which change considerably 
across regions. The bottom row of Fig. 2 shows our estimates plotted against Fk- 
We notice that our method shows a tendency towards underestimation of high 
cell frequencies and overestimation of small cell frequencies. Different graphs 
could be obtained using different summaries of the posterior distribution [Ffc|/fc], 
such as the median or the mode. The method was, however, not developed to 
obtain estimates of the population cell frequencies, and so we do not show these 
results. Our main aim is to estimate the disclosure risk, the Benedetti-Franconi 
definition of which is not the only one possible. Accordingly, we have investigated 
other summaries of the posterior distribution [l/Ffc|/fc], such as the mode and 
the median, as these can provide more appropriate estimates of risk, especially 
when [l/Ffc|/fe] is skewed. In fact, the above posterior mode and median can be 
obtained as the reciprocal of the mode and median of [F)c|/fc]. In experiments we 
found that the mode provided the best results for the disclosure risk. 

Fig. 3 analyzes the performance of the risk estimates obtained from our 
procedure by comparing these with (a) l/F^ and (b) the estimated Benedetti- 
Franconi risk ffc. There is a strong relationship between these quantities. As 
demonstrated in [DCFS03] and [Rin03], the Benedetti-Franconi risk can provide 
an overestimate of the true risk, especially for small sample cell frequencies which 
correspond to not very small cells in the population. These can be clearly seen 
in Fig. 4. In these cases, our method provides a lesser extent of overestimation. 
Since we return a single risk value for any sample unique, our method underesti- 
mates the high risk cells more than the Benedetti-Franconi method. On average 
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TrueF Trug F 

b = 80 b^O.5 



Fig. 2. Scatter plot of the estimated Fk obtained from the proposed method against 
the estimates (top row) and against Fk (bottom row). The left column (dots) 
corresponds to estimates obtained using b = 80, while the right column (triangles) 
corresponds to b = 0.5. Logarithmic scales are used for all axes 



however, our estimates for the sample unique cells are in good agreement with 
the average true risk. 

The model that we estimate using the log-likelihood (3) does not allow {a,b) 
to vary across combinations. Therefore, even though variation across cells is al- 
lowed through the parameter pk, the estimated population cell frequencies only 
depend on fk- As all cells with the same sample frequency have the same es- 
timated risk, the method does not allow us to effectively discriminate between 
low frequencies that are due to the sampling effect, and low frequencies that 
correspond to characteristics that are rare in the population. To overcome this 
drawback we suggest allowing the parameters (ak,bk) of the beta distribution 
discussed in Sect. 2 to vary across cells. This is achieved, for instance, in the ap- 
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1 / true F B+F risk 

(a) (b) 

Fig. 3. Scatter plot of the disclosure risk estimated using the proposed method as the 
reciprocal of the mode of [Ffc|/fc] against (a) l/Fji and (b) the estimated Benedetti- 
Franconi risk fk- Grey triangles represent the Val d’ Aosta region, the population of 
which is considerably smaller than that of the other regions 



proach suggested by Benedetti and Franconi, by exploiting the sampling weights. 
Alternative approaches that take into account the association between key vari- 
ables are mentioned in Sect. 5. We also notice that the scatter plots always 
show two stripes. These are due to the sampling design, which is arranged to 
provide estimates that have the same precision across regions by ensuring that 
the sampling size is approximately constant. Our data come from the very small 
Val d’Aosta region, as well as the large Campania, Lazio and Veneto regions. 
Hence Val d’Aosta has an allocation that is more than proportional to the size 
of its population. Compared with the other regions, the binomial distributions 
that we used to model the sample frequencies for Val d’Aosta should therefore 
have higher probabilities pk- Our estimates clearly provide an average of these 
different patterns. A possible solution that we have begun to investigate is to 
allow two different sets for a and b, one for Val d’Aosta and the other for the 
remaining regions. For given Fk, the beta distribution for pk would then ac- 
commodate /fcS that for Val d’Aosta are higher than for the other regions. The 
corresponding risk estimates should change accordingly, and the stripes should 
disappear from the graph. Initial experiments have indicated that this will be a 
successful strategy. We will report further details elsewhere. 

5 Conclusion and Discussion 

In order for National Statistical Institutes to perform risk assessment and data 
protection, they need to define and estimate measures of disclosure risk. In this 
paper we investigated a Bayesian hierarchical model for risk estimation, and ap- 
plied this framework to sample data from the 1991 Italian Census. We worked 
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Fig. 4. Scatter plot of the estimated Benedetti-Franconi risk rk for sample uniques 
against IjFk. The horizontal line and the superimposed triangles show the risk esti- 
mated using the proposed methodology 



with the posterior distribution of [l/ffc|/fe]. We considered several summaries of 
this distribution, namely the mean, median and mode, to estimate the risk of 
disclosure; in experiments, the mode provided the best results. While indicat- 
ing that the proposed methodology is sensible, our results clearly indicate the 
necessity to take account of the complex sampling scheme. In addition, the as- 
sumption of independence between cell frequencies neglects the structure of the 
contingency table formed by the key variables and possible associations between 
them. We now suggest several possible adjustments and alternative strategies. 

[SH98] and [ES04] use log-linear models to make the risk depend on the 
observed values of the key variables. A different approach that might better 
suit our model-based methodology could be based on composite estimators as 
used in small area estimation; see [GR94], for example. In our case a composite 
estimator for Fk could take the form Fk = 5kFj? + {1 — Sk)Fk, where Fk is our 
model based estimate of Fk and Sk & [0, 1] is a weight. Analogously, a composite 
estimator for the risk could be defined as fk = ^k/F^ + (1 ~ 5k)f\'^, where 
is our risk estimate. In experiments we took 5k to depend on fk- Results not 
reported here indicate that this strategy has several advantages: the estimates 
show better agreement with the true risk; the risk differs across combinations 
having the same sample frequency; the effect of different sampling ratios that 
we have discussed in Sect. 4.2 is corrected. 

In order to produce genuine cell-specific risks under the model that we pro- 
pose, we could alternatively exploit the sampling probabilities to estimate cell- 
specific beta parameters and bk- This strategy will be analysed elsewhere. 

Our model-based approach can be thought of as an approximation, valid 
for large N and K, of a more general multinomial model-based approach that 
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represents the structure of the contingency table formed by the key variables; see 
[Car02]. In order to take better account of this structure and of the constraints of 
this contingency table, we have formulated and begun to investigate a different 
hierarchical model: 



K 

7T ~ Dirichlet(a, . . . , a), tta, > 0 for all fc, E TTfc = 1 , 

k^l 

K 

F\tt ^ multinomial(A^; tt), = 0, 1, . . . for all A:, E Fk = N , 

k^l 

f\F ^ multinomial (n; j , /^ = 0, 1, . . . , for all k,^fk = n , 

^ ^ k^l 

where tt = (tti, . . . ,ttk) and F and f are defined similarly. This multinomial 
model-based approach captures particular aspects of the problem better than 
the one discussed above: each of the probabilities takes values in 

[0, 1] and both the population and sample frequencies sum to the 

known totals N and n respectively; no assumption of independence has been 
introduced. 

As it is now possible to obtain an expression for the posterior distribution of 
F given / up to a proportionality constant, we propose simulating from [F\f] 
and hence [Tfc|/] to obtain numerical estimates of risk. These risk estimates will 
depend on the parameter a. We shall report results from the application of this 
multinomial model-based approach elsewhere. 
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Appendix 

An Approximate and Fast Way 
of Evaluating the Hypergeometric Function 

The main difficulty with the evaluation of the Hypergeometric function 2F1 given 
in ( 1 ) is caused by the integral 
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J + wty^dt , (4) 

0 

where w = —z. We now approximate (4) using Laplace’s method. 

Let g{t) = + wt)~^ be the integrand of (4) and let 

h{t) = logg(t) = (i? — 1) logt + (C — i? — 1) log(l — t) — ^log(l + wt). We shall 
approximate ft- by a quadratic function ft with maximum at the same point t* G 
(0, 1) as the maximum of ft, if such a point exists. 

We now consider when t* does exist. For this we calculate h'{t) and h”{t) to 
be: 



ft'(t) 

h"{t) 



B-1 C - B-1 Aw 
t I — t l + wt 

B-1 C- B-1 Aw^ 

(1 — ty (1 + wt)'^ 



If there is a t* G (0, 1) such that = 0, we have 

Aw _ B-1 C- B-1 
l + wt* ~ 1-t* 

from which it follows after a little work that 



H-1 1 


r 1 1 


1 C-B-ll 


f 1 + w 


t* 1 


(t*(l + wt*) J 


\ 1-t* 1 





Hence h"{t*) < 0 if i? > 1 and C — B > 1. 



Approximating h{t) 

In order to approximate ft by a quadratic function ft, we first solve h'(t*) = 0. 
If t* G (0, 1), this is equivalent to solving the quadratic equation 

{B-l) + {{B-A-l)w + 2-C}t* + {2 + A- C)wt*^ = 0 (5) 



for t* . The solutions of (5) are 



(1 + A - B)w + C-2 + VD 
~ 2{2 + A-C)w ’ 

where 

D = {{1 + A- B)w + C-2f - 4(B - 1)(2 + A - C)w , 

provided 2 + A — C yf 0. We choose the solution t* such that t* G (0, 1) and 
h"{t*) < 0, if it exists. If 2 + A — C = 0, (5) becomes linear and 



t* = 



1 - B 

{B-A-l)w + 2-C 



G (0,1) , 



for the particular values of A, B, C and w that appear in Sect. 3. 
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We next approximate h{t) by h{t) = D + E{t — t*Y, in which D = h{t*) and 
E = h"{t*)/2 < 0. 

We can now approximate (4) as follows: 

1 1 



/ 



g{t)dt 



exp {h{t)} dt 



1 

J exp dt 
0 



1 

J exp {D E{t — t*)^} dt 
0 



i-t* 




-t* 



changing variables tou = t— t*. Setting = —1/{2E) = — l/{/i"(t*)} we obtain 



0 

where ^( 2 ) = P{Z < z), in which Z ~ A^(0, 1). This formula can be easily 
implemented and quickly calculated in S-PLUS. 

We have recently learnt that [BWOO] have studied Laplace approximations for 
Hypergeometric functions with matrix z as the fourth argument. When reduced 
to a scalar z, their approach is slightly different from ours and may be somewhat 
more accurate and numerically stable. We leave the detailed comparison of these 
approximations for further investigation. 
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Abstract. Individual risk estimation was one of the issues that the Eu- 
ropean Union project CASC targeted. On this subject ISTAT has built 
on previous work by Benedetti and Franconi (1998) to improve individ- 
ual risk measures. These permit the identification of unsafe records to 
be protected by disclosure limitation techniques. The software /r-Argus 
contains now a routine, that has been implemented by CBS Netherlands 
in cooperation with ISTAT, for computing the Benedetti-Franconi or in- 
dividual risk of disclosure. The paper reviews the main aspects of the 
individual risk methodology. Such an approach defines measures of risk 
for protection of files of independent records as well as hierarchical files. 
The theory and some practical issues such as threshold setting are illus- 
trated for both cases. 

Keywords: Argus software, data protection, global risk, individual risk, 
household data, local suppression, re-identification disclosure, threshold 
setting, unsafe combinations 



1 Introduction 

One of the main objectives of the European Union project IST-2000-25069 CASC 
on Computational Aspects of Statistical Confidentiality was to develop a free 
software for data protection. For files of microdata, /t- Argus see [Hun02] aims to 
support users in charge of data release during the phase of data protection. The 
software builds further on the version released under the Statistical Disclosure 
Control (SDC) project within the European Union’s 4*^ Framework Program. 
Among the features of /x- Argus, the user can apply local suppression to selected 
records or global recoding to the whole file see [WdWOl]. Measures of risk per 
record can be exploited in the process of data protection to judge where to 
use local suppression or to assess the effectiveness of the global recoding; for 
this reason per-record risk estimation was one of the issues that the CASC 
project targeted. On this subject ISTAT has built on previous work by [BF98] 
to improve one such risk measure, that is used to identify unsafe records. The 
software /x-Argus contains now a routine, that has been implemented by CBS 
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Netherlands in cooperation with ISTAT, for computing the Benedetti-Pranconi 
(B-F) or individual risk of re-identification. 

This paper reviews the main aspects of the B-F individual risk methodology. 
We discuss measures of risk for files of independent records as well as for hierar- 
chical files. We limit our discussion to methodological aspects of the procedure, 
while referring to [PS03] for the practice of the individual risk methodology 
within /i- Argus, and to the Argus manual for an overview of the software^. 

A relevant characteristic of social microdata is its inherent hierarchical struc- 
ture, which allows us to recognise groups of individuals in the file, the most typi- 
cal case being the household. When defining the re-identification risk, it is impor- 
tant to take into account this dependence among units: indeed re-identification 
of an individual in the group may affect the probability of disclosure of all its 
members. 

So far, implementation of a hierarchical risk has been performed only with 
reference to households. We will therefore refer to individual and household risk 
for files of independent units and hierarchical files, respectively. 

The individual risk methodology is currently the only approach that allows 
for a hierarchical structure in the data, which is in fact typical in many contexts. 
Allowing for dependence in estimating the risk enables us to attain a higher level 
of safety than when merely considering the case of independence. 

The paper is organised as follows: in Sect. 2 we discuss measures of risk of 
re-identification. In particular, in Sect. 2.2 and 2.3 we define the individual risk 
and discuss its estimation using sample data, whereas in Sect. 2.4 we introduce 
the household risk, use of which is appropriate when a hierarchical structure 
is present in the microdata. Section 3 is devoted to the combined use of these 
measures of risk and local suppression to protect a file as it is currently imple- 
mented in /x-Argus. In order to identify unsafe records, a risk threshold is to 
be selected. Section 3.1 describes the tools that /x- Argus offers to perform this 
task, whereas Sect. 3.2 presents methods for threshold setting using global risk 
measures. Finally, in Sect. 4 we present our conclusions. 

2 Extending the Measnre of Risk from Individnal 
to Household Samples 

In this section we discuss two related measures of disclosure risk, both imple- 
mented in /X- Argus. The first (the individual risk) applies to files of independent 
records, whereas the second (the household risk) suits hierarchically structured, 
e.g. household, microdata. Under this circumstance it is possible to locate units 
within the household and, to a certain extent, also establish relationships be- 
tween members of the same household. The household risk is defined in terms of 
the individual risk; the rationale behind such a definition is that if a household 
is re-identified, its members can be re-identified. 

The definition of our risk measures is based on the concept of re-identification 
disclosure e.g. [CKM98,FM98,SH98,DL86,WdW01], and is appropriate for sam- 

^ Available at http://neon.vb.cbs.nl/CASC/MU.html 
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pies of microdata stemming from social surveys. By record re-identification we 
mean that the unit in the released file and a unit in the register that an intruder 
has access to belong to the same individual in the population. The underlying 
hypothesis is that the intruder will always try to match a record and a unit in the 
register (i.e. if the record is not unique he will try a probabilistic link), and that 
he will use public domain variables {key variables) only. As external registers 
that contain information on households are not commonly available, for files of 
households we make the assumption that the intruder attempts a confidentiality 
breach by re-identification of individuals. 

In social survey data, categorical key variables allow us to tackle the problem 
of disclosure limitation via the concept of unique or rare combinations in the 
sample. A key issue is to be able to distinguish between combinations that are 
at risk, for example sample uniques corresponding to rare combinations in the 
population, and combinations that are not at risk, for example sample uniques 
corresponding to combinations that are common in the population. To this aim, 
a step of inference from the sample to the population is performed. Instead of 
focusing on the sample frequency of combinations of key variables, the individual 
risk of disclosure can be defined as the probability that a sampled record is re- 
identified (i.e. recognised as corresponding to a particular unit in the population). 
This value can then be estimated for each record in the released file on the basis 
of the observed sample. This approach would allow for a more parsimonious use 
of suppressions than the traditional local suppression methodology based on the 
sample frequency of combinations. The latter is present in Argus since the 
European Union project SDC see [WdWOl]. See [FP03] for a comparison. 

In the last few years a number of proposals have been made: [FM98], [SH98] 
and [ES04] define, with different motivations, a log linear model for the estima- 
tion of the individual risk. [BF98] propose a methodology to estimate a measure 
of risk per record using the sampling weights, as the usual instrument that Na- 
tional Statistical Institutes adopt to allow for inference from the sample to the 
population. Further discussion of the approach is in [DCFS03], [Pol03], [Rin03]. 
A related approach is described in [Car02]. 



2.1 Some Notation 

Let the released file be a random sample s of size n drawn from a finite population 
P consisting of N units. For a generic unit i in the population, we denote Wi 
its final sampling design weight. Under the hypothesis that the key variables 
are discrete, cross-tabulating these produces a set of combinations {1, . . . , AT}. 
A combination k is defined to be the /c-th cell in the cross-tabulation. The set of 
combinations defines a partition of both the population and the sample, and the 
values of the key variables observed on unit i will classify such a record into one 
combination. We denote by A: = k{i) the cell into which the sampled record i 
falls. Let fk and Ff^ denote the size of the k-th cell in the sample and population, 
respectively. Retaining only the observed combinations -combinations with zero 
sample frequency being omitted- does not alter the above partition of the sample. 
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2.2 Definition of the Individual Risk 



Assume for simplicity that there is complete agreement between the sample 
and the external archive available to the intruder, as far as the key variables 
are concerned (for a more general setting, see [Pol03]. We first note that if we 
were to know the population frequency of the /c-th combination, we would 
define the probability of re-identification simply by 1/-Ffc, for each record that 
is classified in combination k (i.e. Vf : k{i) = k). The population frequencies 
are generally unknown, therefore an inferential step is to be performed. In the 
proposal by [BF98] the uncertainty on is accounted for in a Bayesian fashion 
by introducing the distribution of the population frequencies given the sample 
frequencies. The individual risk of disclosure is then measured as the (posterior) 
mean of l/Fk with respect to the distribution of Fk\fk- 




To determine the probability mass function of Tfc|/fc, the following superpopu- 
lation approach is introduced [BKP90,Rin03,Pol03]: 

TTfc ~ [TTfc] oc 1/TTfc, TTfc > 0, k =1 , . . . ,K, independently 

~ Poisson (iVTTfe), Fk =0, 1, . . . , independently (2) 

/fc|Tfe, 7Tfc,pfc ~ binomial(Ffe,pfe), fk =0, 1, ... , Fk, independently . 



Under these assumptions, the posterior distribution of Fk\fk is negative bino- 
mial with success probability pk and number of successes fk- In general, the 
probability mass function of a negative binomial variable counting the number 
of trials before the j-th success, each with probability pk, is the following: 



Pr {Fk = h\fk = j) = Q _ ^Pi (1 - Pk)'' \ h>j. 

In [BF98] it is shown that under the negative binomial distribution the risk (1) 
can be expressed as 



Vk = E{F^^\fk) 



The transformation exp(— t) = y 



OO f 

^ /■ I Pk exp {-t) \ 

J \ 1 - gfc exp {-t) j 
0 

gives the integral 



dt . 



^k =pt J ^ (1 - tqk) dt 
0 



that can be expressed [Pol03] via the Hypergeometric function 2 ^ 1 ( 0 , b; c; z) as 



Tk = ^2Fi{fk, fk; fk + 1; Qk) 
Jk 



(3) 



where 
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1 

0 

is the integral representation (valid for 5i(c) > K(&) > 0) of the Gauss Hyperge- 
ometric series [AS65]. 

Substitution of an estimate of pk in (3) leads to an estimate of the individual 
risk of disclosure (1). Given Fk, the maximum likelihood estimator of pk under 
the binomial model in (2) is pk = fk/Fk- Fk being not observable, [BF98] propose 
to use 

Pk — 5 

i:k{i)—k 

where ^ Wi is an estimate of Fk based on the sampling design, possibly 

i:k{i)—k 

calibrated [DS92]. 

The negative binomial distribution is defined for 0 < pfe < 1; in practice, 
estimates pk might attain the extremes of the unit interval. We never deal 
with Pk = 0, as it corresponds to fk = 0. On the other hand, when pk = 1, 
2 Fi{fk, fk, fk + 1,0) = 1, therefore the individual risk equals l/fk- 

2.3 Some Comments on Risk Estimation 

In [DGFS03] an experiment was conducted to assess the performance of the 
individual risk of disclosure. The aim was to investigate whether the individual 
risk is estimating the correct quantity, i.e. the real risk of an individual, and also 
whether the quality of this estimation is appropriate. A good agreement was 
noticed between the real risk and its estimates, although the precision of the 
estimator seems poor for rare combinations in the sample as compared to the 
more common ones. A minor precision is an intrinsic problem of small counts. 
Discriminating rare and common features in the population is, by far, the most 
difficult task especially when one can count on only one occurrence in the sample. 
For this reason further studies to improve the performance of the estimator used 
for the individual risk have been planned. 

Model (2) was questioned to provide a good fit to real data under some 
circumstances [Rin03]. Our experiments seem to indicate that the model holds, 
at least with our data. Notice that the assumed model is compatible with a large 
number of relatively small cells in the population. These might occur with both 
small sized population and large number of combinations of key variables. 

2.4 Household Risk 

From the definition given in Sect. 2.2, it follows that the individual risk can 
be considered an estimate of a probability of re-identification. We define the 
household risk as the probability that at least one individual in the household 
is re-identified. By consequence, the household risk can be derived from the 
individual risks and knowledge of the household structure of the data. 




Individual Risk Estimation in /i-Argus: A Review 267 



For a given household g of size whose members we label zi, . . . i\g\, the 
household risk is defined as 

Tg = Pr{ii U Z 2 U ... U Z|g| re-identified) ; (4) 

Assuming independence of re-identification attempts within the same household, 
(4) can be expressed by Boole’s formula using the individual risks . . . , 
defined in (1): 

Isl 

i=l ij<ii 

By symmetry, the ordering of units in the group is not relevant. In a hierarchical 
file therefore the measure of disclosure risk is the same for all the individuals in 
household g and equals r^. 

3 Protection 

The risk model implemented in /i- Argus permits to classify records into safe or 
unsafe according to their individual risk and, if relevant, the household structure 
of the file. /i-Argus then applies local suppression [dWW98]see to the unsafe 
records only, or global recoding to the whole file. To apply local suppression 
users must select a threshold, e.g. a level of acceptable risk, representing a risk 
value below which an individual can be considered safe. Global recoding can 
be assessed by the same threshold; such a procedure can be repeated until all 
records in the file are safe. 

3.1 Direct Specification of Risk Thresholds 

Having selected the set of key variables and the individual risk methodology, 
/i-Argus returns a graphical facility showing the histogram of the individual risk 
(see the top panel of Fig. 1). For data organised according to a household struc- 
ture, a similar graph will be produced for the household risk (bottom panel of 
Fig. 1). For files of independent records, a threshold on the individual risk can 
be selected either manually, entering a value in the Risk threshold box, or 
through the slider at the bottom of the risk histogram. In both panels of Fig. 1, 
the Calc button allows the user to compute the number of unsafe records (house- 
holds) corresponding to the given threshold. Conversely, the user can specify the 
number of unsafe records (households) and determine the corresponding risk 
threshold via the corresponding Calc button. 

But for providing the number of unsafe records (households), so far /i-Argus 
provides no guidance on how to select a risk threshold; a criterion is discussed in 
Sect. 3.2, first proposed in [PS03] and [Pol03] for non-hierarchical files. Through- 
out this section we assume for simplicity that the user has in mind a target 
threshold level, possibly chosen by the means to be described in Sect. 3.2. 
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Fig. 1. Risk histograms: individual (top panel) and household risk (bottom panel). A 
logarithmic scale is used 



For files of independent records, setting a threshold on the individual risk is 
a straightforward operation that can be accomplished as discussed previously. 
Next we discuss how to select thresholds when dealing with household data. We 
therefore assume that the user has in mind a threshold a on the household risk r^. 
We first notice that each record belonging to household g has the same disclosure 
risk . By definition, the hierarchical risk of household g, r^, is bounded by the 
sum of the individual risks of records belonging to household g (see (5)): 
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Idi 

J=1 



For a threshold a, household g is safe if < a; in order for household g to be 
classified safe, it is therefore sufficient that each of its components has individual 
risk less than Sg = a/\g\. Note that 5g varies across individuals according to the 
household they belong to. In practice, denoting by g{i) the household to which 
record i belongs, a threshold a on the household risk is turned into a vector of 
thresholds on the individual risks Vi, i = 1, . . . , n: 



5g = dg(o = a/\g{i)\ . (6) 

Clearly this is a strongly prudential approach. Records are therefore set to un- 
safe whenever local suppression is then applied to those records, if 

requested. Suppression ensures that after protection the household risk is below 
the threshold a. 

This approach to threshold setting for identifying unsafe records in hierar- 
chical files has been implemented in the latest beta-release of /i- Argus. 



3.2 Methods for Threshold Setting via Global Risk Measures 

As discussed in the previous section, threshold setting results in a (possibly 
household-specific) threshold level of the individual risk. Argus provides no guid- 
ance to the user on how to perform this task; [PS03] and [Pol03] propose to use 
the so-called re-identification rate. 

The individual risk provides a measure of risk per record. A global measure 
of disclosure risk for the whole sample can be expressed by the re-identification 
rate, which is defined by means of the expected number of re-identifications in 
the file. Global measures of this kind are not new in the literature of disclosure 
limitation, see for example [Lam93] and [DFT02]. Here this measure is discussed 
as a tool for setting a risk threshold and identify unsafe records. 

We refer to the interpretation of r^ as an estimate of a probability of re- 
identification to introduce the expected number of re-identifications in the file, 

K 

the re-identification rate is then 



^ = - V' rkfk ■ 

71 f J 






One use of global risk measures is the following: if the re-identification rate 
of the file is below a level rj that the user considers acceptable, i.e. R < g, then 
the file can be released. As R depends on r^s, a threshold a can be determined 
for the individual risk, such that R does not exceed g. From a user’s perspective, 
working on rates or percentages of re-identifications in the file is easier and 
intuitively more appealing than working on the risk scale. Moreover use of a 
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global measure such as R also implies an evaluation on the overall safety of the 
released file. 

A similar reasoning applies when setting the threshold for the household 
risk. Denote by G the total number of households in the file, and define the 

G 

expected number of re-identified households in the file as X) ’"g- The household 

9=1 



re-identification rate 



9=1 

can be exploited to select a household risk threshold, along the same lines as 
above. Once a threshold a on the household risk has been selected, the procedure 
described in Sect. 3.1 can be applied to transform that value into an individual 
risk threshold (see (6)). 

Threshold setting via the re-identification rate is currently in the process of 
being implemented into Argus. 



4 Conclusions 

Individual risk estimation was one of the issues that the European Union project 
IST-2000-25069 CASC on Computational Aspects of Statistical Confidentiality 
targeted. On this subject ISTAT has built on previous work by [BF98] to derive 
individual risk measures that can be exploited in the process of data protection 
to identify unsafe records. One of the main project’s results is the software Ar- 
gus, a practical tool aimed to support users in charge of data release during the 
phase of data protection. In particular the software /x-Argus, specific to micro- 
data, contains now a routine, that has been implemented by CBS Netherlands 
in cooperation with ISTAT, for computing the Benedetti- Franconi (B-F) or in- 
dividual risk of disclosure. We have reviewed in the paper the main results that 
the project has achieved as far as the individual risk methodology is concerned. 

Implementation of the individual risk methodology into the free software ^- 
Argus has made its application easier and allowed testing of the procedure and 
comparisons with other methods. During Argus’ testing phase researchers not 
directly involved in the CASC project have been exposed to theory and applica- 
tion of these techniques, some of whom were new in the literature. In particular, 
a debate on risk models has emerged, concerning the underlying assumptions 
of the individual risk model and the assessment of the theory itself; alternative 
measures of risk and approaches to estimation have been proposed or called for 
[Rin03]. Such a debate had a beneficial effect in suggesting refinements of the 
theory and directions for future work; the theory on the individual risk has been 
further examined [DCFS03,Pol03], and extensions and modifications are cur- 
rently under study by the authors; other approaches to risk estimation are being 
pursued e.g. by [Rin03] (where measures of performance are also mentioned), 
[Car02], [ES04]. 

Further research to improve the performance of the estimator used for the 
individual risk has been planned; in particular, it is foreseen that the use of 
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methodologies borrowed from the area of small domain estimation will allow for 
progress in this field. 
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Analysis of Re-identification Risk 
Based on Log-Linear Models 
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Abstract. The number of unique records in a released microdata set 
which are unique in the population is an important measure of re-iden- 
tification disclosure risk in microdata. However, the microdata sample 
contains information about the disclosure risk more than the number 
of unique records. This paper deals with the development of a technique 
based on a loglinear models to extract more information from the sample 
about the disclosure risk not only through the number of sample unique 
records but also through the number of twin and triple records. These 
information may help microdata release committee in taking decision 
about releasing the data for public use. For illustration we apply the 
proposed method to data from a General Household Survey 1996-1997. 



1 Introduction 

A common concern of statistical offices that release microdata for use by external 
researchers is to diminish the risk of disclosure of information on individuals. It 
is generally accepted that it is insufficient to discard direct identifying variables 
such as names, addresses etc., because individuals may also be recognized on the 
basis of their values on other indirect identifying variables such as a geographi- 
cal indicator, profession and age. If certain combinations of values of identifying 
variables occur only once in the population, the associated individuals are unique 
with respect to these variables. If a researcher knows the values of the identify- 
ing variables for certain unique individuals this researcher can establish a link 
between the record and the individual. Such a link leads to disclosure of the 
remaining information in the record, which was not known beforehand; see, for 
example, [4], [8], [11] and [13]. 

Measures of disclosure risk are often based upon the notion of identifying or 
key variables; see, [2]. These are variables with values assumed known both for 
individuals in the microdata sample and for certain identifiable individuals in 
the population. An example of a measure of disclosure risk is the proportion of 
individuals in the microdata sample which have a unique combination of values 
of the key variables (assumed categorical) in the population; see, [10]. Such 
individuals, referred to as population unique, may be judged to be particularly 
at risk of disclosure. 

As noted by many authors; see, for example, [7], [9], [12] and [10] disclosure 
may also occur when the population contains cells with small counts; for example. 
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the number of twin and triple records. A release of a count of 2 may allow 
someone with a set of almost unique characteristics to identify the only other 
person in the population with these characteristics and a release of count 3 in 
the absence of other knowledge allows someone else to be linked to an intruder’s 
data base containing these same variables with probability of either 1/2 or 1/3 
depending on whether the intruder possesses such characteristics. Thus it is of 
interest to consider the number of twin and triple records as well as the number 
of unique records where these records may still be a cause for concern. 

In this paper we study the risk of disclosure not only through the number 
of unique records but also through the number of twin and triple records in the 
microdata sample which are also twins and triples in the population. This extra 
information may reflect more risk than that reflects by the number of unique 
records, gives more information and helps the microdata release committee in 
taking decision about releasing the data to public use. 

The notation and assumptions are given in Section 2. In Section 3 we present 
the proposed method. In Section 4 we apply the proposed method to microdata 
from British General Household Survey from 1996 to 1997. 



2 Notation and Assumptions 



Consider a finite population U of size N from which a simple random sample 
s C [/ of size n < N is drawn. The sampling fraction is denoted by tt = n/N. 
Following [2], we assume that the possibility of statistical disclosure arises if an 
intruder gains access to the microdata and attempts to match a microdata record 
to external information on a known individual using the values of m discrete key 
variables Xi, X 2 , ■■■■, Xm- In order to define some measures of disclosure risk 
we introduce some further notation. Let the variable formed by cross-classifying 
Xi,X 2 , ...., Xm be denoted X, with values denoted 1, ...., J, where J is the num- 
ber of categories or key values of X . Each of these key values corresponds to 
a possible combination of categories of the key variables. Let Fj be the num- 
ber of units in the population with key value j, i.e. the population frequency 
or size of cell j for j = 1, ...., J, and let the population frequencies of frequen- 
cies to be Nr = ^j=i^{Fj =r), r = 1,2,.... For example, Ni is the number 
of population uniques and N 2 is the number of population twins. The sample 
counterpart of Fj is denoted by fj and the sample frequencies of frequencies 
by Ur = J2j=i I ifj = r), r = 1, 2, .... For example, ni is the number of sample 
uniques and ri 2 is the sample twins. 

Following [12] we assume the following model which arises if the Fj are in- 
dependently Poisson distributed with mean Xj as 



P{Fj=x) 



AJexp(-Aj-) 

x\ 



and the sample is selected by Poisson sampling with selection probability tt. We 
write Fj = fj + (Fj — fj) where fj and Fj — fj are independent with Poisson 
distribution (Po) as 
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fj I >^3 ~ Po (ttAj) and Fj - fj \ Xj ~ Po [(1 - tt) A^] 
where the Xj are treated here as fixed parameter and there is no misclassification. 



3 The Proposed Method 

A commonly used global measure of the risk of a sample microdata file is the 
probability that a unit that is unique in the sample (SU) is also unique in the 
population (PU) which is defined as 



Pr(PU|SU) = 



Pr(PU,SU) E,I(P; = h/i = l) 



Pr (SU) 






( 1 ) 



This measure depends on the number of uniques in the sample and the popula- 
tion I (Fj = 1 , fj = 1 ). Although this number of unique records is important 
and reflects more risky records, we extend it to the number of twin and triple 
records in the sample and population where we study: 

— The expected number of population uniques and sample uniques 



C'li — E 






The expected number of population twins and sample uniques and sample 
twins 



C21 — E 



^I(F,=2,/, = 1) 



and C22 = E 



^I(F,=2,/,=2) 



The expected number of population triples and sample uniques, twins and 
triples 



C31 — E 



^I(E,= 3 ,/, = 1 ) 



C32 — E 



^I(F,= 3 ,/,= 2 ) 



and 



C'ss — E 



^I(F,= 3 ,/,= 3 ) 



We could base our decision on the following rule of thumb 

C'li > ^21,022 > 0:2 and C 3 i,( 732 ,C 33 > «3 

where oi, «2 and 0:3 are values large enough to ensure that the risk is negligible. 
Under the assumptions of the Section 2 we have this theorem. 
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Theorem 1. Under Poisson sampling with fixed Xj and r < s we have 



E 



— 'Sj fj 



r) 



E 



pJj[{l-TT)Hj/’KY eXp{-p.jlTT) 
rl (s — r)! 



(2) 



where pLj = E{fj) = ttXj. 

Proof. To prove this theorem we first note that 



E 



T.^(F3 



sjj 



r) 



= E 



E ^ - fj = s-r)l {fj 



r) 



where Fj — fj and fj are independent we find that 



E 



T.^(F3=^J3 



r) 



J2^[l{F,-fj = s-r)]E[l{f,=r)] 



using Poisson assumption given Xj we find 



E 



E ^ fi = 



E 

j 



{nXjY [(1 - 7t) XjY ’’ exp (-Aj) 



r! (s — r)! 

substituting pLj =F{fj) = rrXj we obtain 

Mi [ (1 - tt) Hj / exp {-Hj/n) 



E 



E ^ ^3 = Y 



= E 



rl {s — r)! 



where s = 1, 2, 3, .... and r = 0, 1, 2, 3, .... 



It is interesting to find from (2) relation between the expected value of popu- 
lation uniques and sample zeros, Cio = E I {Fj = 1, fj = 0)], and population 
uniques and sample uniques, Cn = E E I (^i = fj = 1)]> 

Cio = — Cn 

7T 

where Cio = [(1 ~ 3t) /tt] ^ ptjexp {—fij/n) and Cn = ^ /x^exp {—iij/ir). 

Table 1 gives the relation between tt and Cio/Cn under Poisson assumption. 
This Table shows that the number of population uniques which is not in the 
sample is high for small sampling fraction and decreases with increasing sampling 
fraction with respect to population and sample uniques. We give more discussion 
about this point in the conclusion. 
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Table 1. The relation between tt and Cio/Cn under Poisson assumption. 

7T 0.001 0.002 0.003 0.005 0.01 0.03 0.05 0.07 0.10 
^ 999 499 332.3 199 99 32.3 19 13.3 9 

(-'ll 



3.1 Estimation from Microdata Sample 

In practice Fj will generally be unknown. We now consider estimate of 

E I (Fj = s, fj = r) from microdata sample {fj) using loglinear model 
under Poisson assumption. Let us condition on the values of the key variables 
defining X, a set of indirect characteristics that may known about respondents 
and that may be used to identify them. Recall that ^,j = E {fj) and fj ~ Po 
where /tj = irXj . A log- linear model for the /i^ may be expressed as 

log Ml = x'j!3 

where Xj is a vector containing main specified effects and interactions for X\, 
...,,Xm- Such a model may be fitted using standard procedures, [1] and [5], to 
give an estimated vector [3 and fitted values 



Hj = exp 






From (2) and (3) the estimate of E I {Fj = s, fj = r) 

E ^ 



IS 



E 






r! (s — r)! 



(3) 



(4) 



which could be estimated given the microdata sample. 

Thus the estimate of Cn, C 21 , C 22 , C 31 , C 32 and C 33 from (4) could be 
obtained as 

Cn = 2^ Mj-exp (-lij/TT) , C 21 = 2^ , 

3 3 

(-22-2^ 3 ’ '^31-2^ 



27t2 



and 



^ (1 - 7t) Mfexp (-Mj/tt) - Mfexp (-Mj/tt) 

*^ 32 - 2 ^ . (- 33 - 2 ^- 



6 



3 3 

Note also that we could estimate the measure of risk defined in (1) as 



Pr(PU|SU) = 



E 



J2^l{Fj = lJ, = l) 



E 



where ni is the number of uniques in the sample. 



ni 




278 Elsayed A.H. Elamir 



To obtain '[Ij we fit loglinear model using iterative proportional fitting using 
main effect and all two interaction. For example, if we have four variables we fit 
these two models 

log Hijki = A + Xf + Xj + Xf + 

and 



log/Xyfci — A + Af + AJ + Af + A^ + Xfj^ + Xfif 

+ Af/ + XY + XY 

where i = j = l...r 2 , k = l.-.rs and I = l...r 4 and ri, r 2 , and r 4 are the 

numbers of categories for each variable. 

4 Application 

In this section we seek to evaluate the properties of the Pr (PU|SU), Cn, C 21 , 
C 22 , C 31 , C 32 and C 33 empirically using an artificial finite population. As a basis 
for this study, we construct an artificial population file by combining data for 
two years (1996, 1997) from the U.K. General Household Survey, resulting in 
records on = 33142 individuals with no missing data. Following consideration 
of possible intruder scenarios by [ 6 ], we use the following m = 5 key variables: 

1 . Xi sex in 2 categories. 

2. X2 marital status in 7 categories. 

3. A 3 Economic status in 13 categories. 

4. A 4 socio-economic group 10 categories. 

5. A 5 age in ten-year bands in 8 categories. 

There are J = 2x7xl3xl0 x 8 = 14560 possible key values defined by the 
combinations of values of these key variables. We evaluated the estimated values 
Pr (PU|SU), C\i, C21, C22, G 31 , C32 and C 33 for six simple random samples from 
this population (tt = 0.01, 0.02, 0.03, 0.05, 0.07 and 0.10). We next compute 
values of 'jlj for each simple random sample using iterative proportional fitting; 
see [3], for the following two specifications of the model in (3) as 

— Model 1: a log-linear model including all main effects; 

— Model 2: a log-linear model including also all two-factor interactions. 

Tables 2 and 3 show the simulation results of the true and estimated numbers 
of different quantities under models 1 and 2 and the number of replication is 
100. The results in Table 1 for model 1 show that this model is overestimate the 
values of Pr(PU|SU), Cn, G 2 i, C22, Czi, C32 and G 33 , for example the true 
values of Pr ( PU| SU) are about 0.09 and 0.11 for tt = 0.03 and 0.05 while the es- 
timated values are 0.43 and 0.58. The bias is positive and increase as tt increase. 
A possible explanation on this as follows. Under the independence model esti- 
mated cell probability is positive for every cell, whenever the univariate marginal 
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Table 2. Mean of true, estimated and bias of Pr(PU| SU), Cn, C21, C22, C 31 , C32 
and C33 from model 1 using five variables with different sampling fraction (tt) and the 
number of replication is 100. 



TV 

0.01 0.02 0.03 0.05 0.07 0.10 

PU |SU True 0.05443 0.07045 0.09474 0.11250 0.14240 0.1769 





Est. 


0.2563 


0.3513 


0.43575 0.58541 


0.7220 


0.8744 




Bias 


0.201 


0.28 


0.34 


0.47 


0.58 


0.70 


Cn 


True 


7.36 


14.16 


23.24 


34.28 


48.9 


71.5 




Est. 


34.5 


69.99 


106.4 


177.07 


247.8 


353.2 




Bias 


27.14 


55.83 


83.16 


142.79 


198.9 


281.7 


C21 


True 


6.44 


12.4 


18.24 


31.24 


41.52 


57.12 




Est. 


45.6 


91.51 


139.8 


228.9 


313.8 


434.6 




Bias 


39.16 


79.1 


121.56 


179.38 


272.28 


377.48 


C22 


True 


0.040 


0.120 


0.240 


0.680 


1.760 


3.04 




Est. 


0.231 


0.934 


2.16 


6.02 


11.81 


24.15 




Bias 


0.19 


0.814 


1.92 


5.34 


10.05 


21.11 


C31 


True 


5.76 


11.24 


17.04 


25.88 


34.24 


46.84 




Est. 


43.8 


86.21 


130.3 


209.7 


281.3 


378.05 




Bias 


38.04 


74.97 


113.26 


183.82 


247.06 


331.21 


C32 


True 


0.004 


0.240 


0.560 


1.960 


2.48 


5.20 




Est. 


0.044 


1.75 


4.03 


11.04 


21.17 


42.01 




Bias 


0.04 


1.51 


3.47 


9.08 


18.69 


36.81 


C33 


True 


0 


0 


0 


0.04 


0.04 


0.04 




Est. 


0.001 


0.012 


0.041 


0.193 


0.531 


1.55 




Bias 


0.001 


0.012 


0.041 


0.153 


0.491 


1.51 



frequencies are all positive. This implies that the probability mass is spread out 
over the contingency table. However, there exist correlations among variables 
are not capture by independent model. Therefore under independence model 
the probabilities of empty cells tend to be underestimated and the probabilities 
of nonempty cells tend to be overestimated. Since the probabilities of Cn, C 21 , 
C 22 , C 31 , C 32 and C 33 are overestimated, this leads to an overestimated number 
of these quantities. While Table 3 gives the results for model 2 and show that 
this model gives better estimators for Pr (PU| SU), Cn, C 21 , C 22 , C 31 , C 32 and 
C 33 , for example, the true values of Pr(PU|SU) are about 0.09 and 0.11 for 
7 T = 0.03 and 0.05 while the estimated values are 0.09 and 0.14. This suggests 
that the fit is improving by incorporating two- variables interactions. 



5 Conclusion 



The distinction between sample uniqueness and population uniqueness is impor- 
tant in the assessment of disclosure risk for categorical data. We have considered 
how the loglinear models could be useful not only for estimating the number of 
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Table 3. Mean of true, estimated and bias of Pr(PU| SU), Cn, C21, C22, C31, C32 
and C33 from model 2 using five variables with different sampling fraction (tt) and the 
number of replication is 100. 



7 T 

0.01 0.02 0.03 0.05 0.07 0.10 

PU|SU True 0.05443 0.07045 0.09474 0.11250 0.14240 0.1769 
Est. 0.01862 0.05798 0.09429 0.1460 0.18031 0.2231 





Bias 


-.0358 


-.0124 


-.0004 


0.0335 


0.03791 0.0462 


Cii 


True 


7.36 


14.16 


23.24 


34.28 


48.9 


71.52 




Est. 


2.53 


11.61 


23.09 


44.27 


63.21 


91.23 




Bias 


-4.83 


-2.55 


-.15 


9.99 


14.31 


19.71 


C21 


True 


6.44 


12.4 


18.24 


31.24 


41.52 


57.12 




Est. 


2.78 


11.62 


21.71 


39.23 


53.63 


78.51 




Bias 


-3.66 


-.78 


3.47 


7.99 


12.12 


21.39 


C22 


True 


0.040 


0.120 


0.240 


0.680 


1.76 


3.04 




Est. 


0.014 


0.118 


0.336 


1.033 


2.20 


4.64 




Bias. 


-.026 


-.002 


0.096 


.353 


0.44 


1.6 


C31 


True 


5.76 


11.24 


17.04 


25.88 


34.2 


46.84 




Est. 


2.84 


10.92 


19.55 


33.79 


43.8 


58.23 




Bias. 


-2.92 


-.32 


2.51 


7.91 


9.6 


11.39 


C32 


True 


0.040 


0.240 


0.560 


1.96 


2.48 


5.2 




Est. 


0.0288 


0.223 


0.605 


1.77 


3.46 


7.1 




Bias 


-.0112 


-.017 


.045 


-.19 


0.98 


1.9 


C33 


True 


0 


0 


0 


0.04 


0.04 


0.04 




Est. 


0 


0 


0 


0.03 


0.09 


0.27 




Bias 


0 


0 


0 


-.01 


0.05 


0.23 



uniques in the sample which also unique in the population but also to give more 
information about the number of twin and triple records in the sample which 
are in the population. This information could be useful in taking decision about 
releasing the data for public use. 

From ( 2 ) we may be able to infer something about cells with population 
uniques and cells containing counts of zeros from sample data, also, population 
twins and sample uniques. This leads us to very important question “Is there a 
disclosure and harm to an individual if such an individual could be identified as 
existing but he is not in the sample?” . This requires further investigation. 

If a very complex log-linear model is chosen then the resulting 
E I (Fj = s, fj = r) may either be unstable or not very informative. In 
the extreme case, if a saturated model is employed, 'jlj = 1 for all j and the 
E I {Fj = s, fj = r) fail to discriminate at all between the sample cases. 
This suggests selecting a simpler log-linear model. The problem then is that, if 
the model is too simple, this may fail to capture the variation between the fXj, 
that is there may be overdispersion. Also it is of interest to study the robust- 
ness to model choice and whether any improvements can be obtained through 
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the use of higher-order interactions and model selection techniques. Allowance 



for overdispersion in E 



E,- 1 (Fj = s, fj = r) 



and model choice require further 



investigation. 

Moreover, the results depend on several further strong assumptions, set of key 
variables available which intruder has, the sampling fraction, ignore any error in 
estimating the parameters l3 of the log-linear model by (3 and no measurement 
error in the data. 
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Abstract. This paper describes a proposal to combine three approaches that will 
enable research access to high quality data while preserving confidentiality. 
These approaches are: developing synthetic microdata which can be accessed at 
a restricted access site (a virtual Research Data Center), together with access to 
the “gold standard” analytical data set through a Research Data Center network. 
It also describes the promise of the development of other datasets - particularly 
multiple public use files that can be created from the same underlying data that 
can be targeted at different audiences 



1 Introduction 

Data, and data access, lie at the heart of social science research. Billions of taxpayer 
dollars are spent in supporting the collection and dissemination of federal, state and 
local data, billions of dollars are spent in data analysis, and this, in turn, both informs 
scientific understanding of core social science issues and guides decision in how to 
allocate billions of dollars in social programs. Although an entire analytical infra- 
structure depends on the dissemination of high quality data, statistical agencies which 
have gone to great expense to collect such data, then deliberately destroy data quality 
- often in ad hoc fashion - in order to protect respondent confidentiality. Indeed, 
many statistical agencies spend millions of dollars, with concomitant respondent bur- 
den, to collect microdata, only to suppress substantial numbers of the resulting tabular 
output, and create tables with unknown statistical properties. 

It is now apparent that new challenges threaten the ability of national statistical in- 
stitutes (NSI’s) to release high quality public use data files (see Doyle et al, 2001). 

Technological advances in computer capacity and matching technology combined 
with the explosion of online access to federal, state and local administrative records 
mean that NSI’s must either severely degrade the quality of public use datafiles or 
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refuse to release them in order to protect respondent confidentiality (see Yancey et al., 
2003, Domingo Ferrer and Torra, 2003 for excellent reviews of matching technol- 
ogy). This has very serious practical consequences. 

The response to this threat by the statistical community has been to develop new 
technical and non-technical approaches that will protect confidentiality but that will 
also maintain the same quality of statistical analysis than was possible using old tech- 
niques (see, for example the work by Agrawal and Srikaut (2000), which exemplifies 
the work on privacy preserving data mining). The NSI community is also responding 
to the issue - the Conference of European Statisticians recently established a working 
group to recommend approaches to micro-data access. 

One very promising technical approach has been to develop multiply-imputed syn- 
thetic micro-data (Rubin 1993). This has the advantage of completely protecting indi- 
vidual confidentiality, as well as providing users with access to data wherever they 
wish, but imposes substantial data producer costs and has been resisted by the user 
community because of data quality concerns. Another approach has been to develop 
restricted access sites, which permit researchers to work on-site with micro-data 
(Dunne, 2001). Yet a third approach has been to develop remote access procedures, 
which has the advantage of reducing researcher burden, but which involves substan- 
tial investments in hardware and software. In addition, there is likely to be consider- 
able bureaucratic resistance to adopting innovative techniques and algorithms to pro- 
tect data transmission (Blakemore, 2001) - and by the time that resistance has been 
overcome, the techniques may be obsolete. 

This paper describes a proposal to combine all three approaches: namely, develop- 
ing inference-valid synthetic microdata which can be accessed at a restricted access 
site, together with access to the “gold standard” analytical data set through a Research 
Data Center network at the U.S. Census Bureau'. It also describes the promise of the 
development of other datasets - particularly multiple public use files that can be cre- 
ated from the same underlying data that can be targeted at different audiences. 



2 Background 

Fienberg (2003) summarized the technical goals of disclosure limitation techniques as 
follows: (i) inferences should be the same as if we had original complete data; (ii) 
researchers should have the ability to reverse disclosure protection mechanism, not 
for individual identification, but for inferences about parameters in statistical models; 

(iii) there should be sufficient variables to allow for proper multivariate analyses and 

(iv) researchers should not only have the ability to assess goodness of fit of models 
but also be provided with most summary information, such as residuals (to identify 
outliers). The core guiding principle should be to generate released data that are as 
close to the frontier as possible. These principles hold just as much for micro-data as 
for synthetic data. 

Most of these principles are obeyed with synthetic datafiles (see Muralidhar and 
Sarathy (2002), and Abowd and Woodcock (2003) for reviews). While the ap- 
proaches vary (one approach is to shuffle data; another is to develop samples com- 



' More detailed technical information is provided in a related paper Abowd and Lane (2004) - 
an early version of which was presented at Statistics Sweden, August 2003. 
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posed of draws from the posterior predictive distribution of the confidential data, 
given some conventionally disclosure-controlled data), a major advantage is that the 
synthetic data contain exactly the same statistical information as the micro data, 
which satisfies Fienberg’s first principle. In the second approach, while the synthetic 
data implicates (described below) are not identical, the analyst can use the between 
implicate variation to measure the extent to which confidentiality protection made the 
inferences less precise, which satisfies the second and fourth principles. The release 
of sufficient variables, principle three, is discussed below. 

But the use of synthetic data as a substitute for public use files produced using 
conventional disclosure limitation techniques has not caught on with the user commu- 
nity. A major problem has been the concern that the results produced from synthetic 
data will not be the same as those from the “real” data. The only way to substantively 
address this is to compare the results from synthetic data products with the results on 
the “gold standard” confidential source file. This poses serious constraints for a num- 
ber of reasons. First, access to the “gold standard” file is, by definition, highly re- 
stricted. Second, because there are typically many different possible uses of the mi- 
cro-data files, even if analysis on the synthetic data-files will be “close” to what is 
achieved using the “gold standard” files with one specification, researchers have rea- 
sonable concern about whether analysis be “close” using alternative specifications^. 

An obvious solution is to develop a two-part access protocol. The first part is to 
create a remote access site - a virtual Research Data Center (RDC) - which can pro- 
vide access to the full metadata repository of information, together with the synthetic 
data. Researchers can use such a site to gain familiarity with the dataset structure, 
develop code, and estimate analytical models. Because the data are synthetic, the 
statistical institute supplying the data to the remote site has to invest considerably less 
in protection technologies, which should dispel some of the concerns raised by 
Blakemore (2001). 

The second part is to provide access to the confidential micro-data at an RDC so 
that the models can be re-estimated on the “gold standard” file. The comparison of 
the two sets of estimates can be distributed as widely as possible - each analysis will 
provide an increment to the common body of knowledge as to what works and what 
doesn’t. The approach is described in the following sections. 



3 The New Approach 

3.1 The Value Added of Synthetic Data Approach 

One attractive feature of the synthetic data approach is that multiple public use files 
can be created from the same underlying data - targeted at different audiences. For 
example, a demographic dataset such as the Survey of Income and Program Participa- 
tion (discussed below) has at least two important user constituencies. One constitu- 
ency is interested in modeling the participation in welfare programs that are state- 
specific, with state specific qualification criteria - in which case geography is critical. 



^ Although this may be due to a lack of researcher familiarity with the disclosure limitation 
approaches currently in practice - and the degree to which increasing protection has affected 
data quality and inference reliability. 
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Another constituency is interested in modelling retirement decisions - in which case 
date of birth is critical. Similarly, there are typically two users of business data. Some 
users (such as transportation agencies) are particularly interested in geographic detail, 
while others are interested in industry detail (such as industry analysts). Providing 
both levels of detail on the same data set immediately re-identifies important busi- 
nesses. Yet jointly releasing both geography and date of birth or geography and indus- 
try creates serious disclosure risk, and hence statistical agencies typically reduce the 
quality of one or the other variable (or both) - reducing their utility to both sets of 
users. However, synthetic data could be used to produce two separate data sets that 
can not be re-linked for such re-identification. 

Another attractive feature of synthetic data is the ability to assess the biases in the 
protection system and the potential to correct public use products - since prior re- 
leases of synthetic data do not compromise proposed new releases. This aspect can be 
facilitated by means of the development of a restricted access data center and access 
to the “gold standard” files at the national statistical institute headquarters. 

There is, of course, some justifiable skepticism that synthesized data might hide 
important relations that a direct use of the confidential data would reveal. This is 
especially important if results are downwardly biased - since this may discourage 
further research. This makes the development of a feedback loop from the synthetic 
data to the confidential micro-data essential to develop confidence in these products 
and to ensure their continuous improvement - which is what is proposed here. 

3.2 The “Virtual” RDC 

A sensible approach for facilitating high quality research is to maintain the data in a 
secure, restricted access environment, but widely distribute synthetic data through a 
restricted access remote site. Because the simulated data can be used at less secure 
sites than the statistical agency itself, researchers can develop an understanding of the 
structure of the datasets and use simulated data to develop code and estimate basic 
relationships before sending the code to the an official secure site to estimate the un- 
derlying relationships from the actual confidential data. 

If multiple users can access the same dataset, and build on an existing database in- 
frastructure, there are numerous advantages. Results can be replicated or expanded - 
which is a critical condition for scientific validity. Researchers can use existing data- 
sets to cut the analysis in different ways, with different foci, which develops a broader 
understanding of the generalizability of results. In addition, the common use of simi- 
lar dataset builds a common body of knowledge, as has been the case with public use 
files such as the Public Use MicroSample for the Decennial Census and the Current 
Population Survey^. 

The cornerstone of this dissemination system is the virtual RDC - a replica of the 
research environment on the Census Bureau’s RDC network. This uses synthetic data 
and the exact programming environment of the RDC network to permit researchers to 
develop research proposals and to interact with key Census employees. The virtual 
RDC can be used for primary research as confidence is built in the validity of the 
synthetic data for analysis of particular types of programs. More importantly, it can be 
used as an incubator for proposals to analyze the confidential data. Researchers can 



^ Indeed, a very powerful case for this approach has been made by Soete and ter Weel, 2003 
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directly benefit from the fact that the structure of the synthetic data and the structure 
of the “gold standard” confidential data were identical. The researcher can develop 
the proposal in the same environment as a real RDC, thus guaranteeing that the tools 
needed to do the modeling were available and working properly. 

A well-used precursor to this model is the Cornell Restricted Access Data Center 
(CRADC, part of CISER at Cornell). The CRADC, which was developed under an 
NSF Social Data Infrastructure grant, as well as a National Institute on Aging grant, is 
the model for the virtual RDC. Authorized users access data from authorized provid- 
ers using a “window” on the CRADC machines (which appear to be ordinary Win- 
dows computers to the user). The CRADC provides a complete research and report- 
ing environment that fully supports collaboration among authorized users of the same 
data"*. Although the CRADC is a reasonable model for a virtual RDC, the virtual RDC 
goes farther. Real RDCs operate with “thin client” interfaces to the RDC computing 
network, a specialized Linux environment. The virtual RDC will provide an exact 
replica of the supercluster computing system that we will implement to create the 
synthetic data and support the complex modeling on the “gold standard” and synthetic 
data. 

The Census Bureau has agreed to support an advisory panel of ten experts and us- 
ers. Their role will be to provide regular (three times/year) feedback on the choice of 
data files to be synthesized and the quality of the data synthesizers. 



3.3 Research Data Centers 

An important component of developing a new confidentiality protection system is to 
develop a research data center (RDC) network in which the quality of the new data 
product can be tested. The more sites that are available and accessible, the greater 
will be the ability of the scientific community to build the core common body of 
knowledge necessary for the acceptance and use of the new data product. 

The existence of such a network is, of course, critical whether or not synthetic data 
approaches are adopted. An important consequence of the increasing threat of re- 
identification is that more and more noise is being added to public use datasets - with 
analytical consequences that would be unknown without access to the underlying 
confidential data. Since noise addition biases coefficients towards zero, researchers 
might, for example, incorrectly conclude that earnings differentials by race and sex 
had vanished over time - rather than realizing that more noise had been added over 
time! 

The basic structure of the RDC network in the United States is well known, and 
described in both Dunne (2001) and on the Center for Economic Studies website 
(www.ces.census.gov). Briefly, RDC’s enable external researchers to access micro- 
data under strict security protocols. All researchers must become Special Sworn 
Status employees of the Census Bureau (which involves fingerprinting, an FBI check, 
and an oath to protect the confidentiality of respondents - which, if broken, subjects 
the researcher to the penalty of a $250,000 fine and/or 5 years in jail). The researcher 
must document which files will be accessed, which variables used, and for which 



Technically, all of the Census products on the CRADC are “public use” files; that is, they 
have been approved by the Census Disclosure Review Board for general distribution. 
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period of time. The researcher must also demonstrate that the predominant purpose of 
the research is to improve Census Bureau censuses, surveys and inter-censal popula- 
tion estimates, and provide a post-project certification that this has been achieved (see 
Greenia, 2004). 



4 Application 

Staff at the Longitudinal Employer-Household Dynamics (LEHD) program at the 
Census Bureau has made substantial advances in the development of a public use file 
containing data from the Survey of Program Participation (1990-1996 Panels) and 
Social Security administrative/tax data^. The confidentiality protection of this public 
use file is particularly challenging because the SIPP source records cannot be re- 
identifiable in the existing SIPP public use files, that is, this new public use file must 
be used independently from the existing SIPP public use files. 

The development of the SIPP-SSA public use file has provided much needed ex- 
perience in developing the layers of the confidentiality program. Since this public use 
file is targeted at retirement and disability research for national programs, all geogra- 
phy has been removed from the public use portion. Of course, the geography is still 
present on the internal files, so RDC access can be provided for those variables. Re- 
moval of the geography was necessary to limit the potential for re-identifying SIPP 
source records in the existing SIPP public use files. Preserving marital relations as 
well as basic demography and education variables provided the maximum extent to 
which conventional identity disclosure control methods could be used. The inter- 
agency committee thought that linking a handful of extremely coarse demographic 
and educational variables from the SIPP to the massive amounts of administrative and 
tax data was not the most effective method of providing access to these data. 

As an alternative, a layered approach was adopted. Successive, confidential ver- 
sions of the linked data including a long list of proposed variables from the SIPP and 
all of the administrative variables from SSA (including the tax data) were developed. 
Researchers at Census, SSA, IRS, and CBO are studying the variables in these files, 
deemed “gold standard” files because they contain all of the original confidential data. 
Once the research teams are satisfied that the gold standard files adequately provide 
for the study of statistical models relating the variables of interest from the SIPP and 
the administrative data, a variety of masked potential public use files will be pro- 
duced. The same research teams will then assess the bias and loss of precision from 
the various techniques. Other research teams will assess the identity and attribute 
disclosure risks from each of the methods. The committee will then be equipped with 
reasonable quantitative measures of the disclosure risks, scientific biases, and losses 
of precision associated with feasible implementations of these new confidentiality 
protection techniques. It is expected that a public use product will be available within 
two years. Interim products include full RDC support for the gold standard files, 
which contain links that permit the RDC use of any variable in the existing public use 
SIPPs. 



^ This effort is undertaken with support and advice from an interagency committee that in- 
cludes the Social Security Administration, Internal Revenue Service, Congressional Budget 
Office and other parts of the Census Bureau. 
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We stress that the SIPP-SSA public use file is not a static product. We fully expect 
the interaction of RDC-based researchers with the data to provide much needed feed- 
back to the process of variable selection and confidentiality protection for such files. 
We further expect that we will eventually develop a full virtual/synthetic data public 
use file for this application. Such a set of files can be iteratively refined, re-released, 
and specialized to different research audiences using the techniques described in the 
synthetic data section above. Successful implementation of this layered confidential- 
ity protection system would provide much wider scientific access to the integrated 
SIPP-SSA data and much higher quality use of the underlying confidential data. 

The broader access is due to the provision of inference-valid virtual/synthetic data 
from a data synthesizer that has been thoroughly tested by the data provider. The 
higher quality use of the underlying confidential data results from the use of the vir- 
tual/synthetic data to refine models and develop hypotheses as a part of the applica- 
tion process for access to the RDC-based confidential data. Thus, by providing direct 
feedback to the data providers (Census, SSA and IRS) the RDC users of the confiden- 
tial micro data identify the strengths and weaknesses of the public use product. The 
providers of the data then incorporate this information into future versions of the 
public use product 



5 Summary 

The continued distribution of public use data-files is clearly threatened by the in- 
creased re-identification risk associated with both technological advances in linking 
software and the widespread availability of administrative records. It is clear that new 
approaches to developing public use data files must be investigated. This paper sug- 
gests the adoption of a three-tiered approach that combines both technical and non- 
technical approaches. The technical approach - the creation of synthetic datasets - 
could, in principle, permit the creation of multiple public use datasets from a single 
underlying confidential file that could be customized for multiple different constitu- 
encies. The non-technical approach is to combine the use of an already well accepted 
RDC network with that of a “Virtual” RDC to both reduce the access costs and de- 
velop a common body of knowledge about the quality of the results generated from 
the analysis of synthetic data files relative to that from confidential micro-data. The 
creation of a feedback loop between users of the data - RDC researchers - and data 
producers is a particularly attractive component of this approach. While the initial 
results have been quite promising, more extensive research is ongoing. 
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Abstract. This paper describes ongoing research to protect confiden- 
tiality in longitudinal linked data through creation of multiply-imputed, 
partially synthetic data. We present two enhancements to the methods 
of [2] . The first is designed to preserve marginal distributions in the par- 
tially synthetic data. The second is designed to protect confidential links 
between sampling frames. 



1 Introduction 

Statistical agencies are frequently confronted with the competing objectives of 
providing high-quality data to researchers and protecting the confidentiality of 
survey respondents. Numerous methods have been developed to protect confi- 
dential data without undue distortion to underlying relationships among vari- 
ables. Commonly used methods include cell suppression, data masking, and data 
swapping (see e.g., [16] or the appendix to [2]). In general, the extent to which 
these methods succeed in protecting confidentiality and preserving the analyst’s 
ability to obtain valid statistical inferences depends on the nature of the under- 
lying data. Furthermore, downstream statistical analyses may require detailed 
knowledge of the disclosure control techniques or specialized software. 

An alternate approach is to develop multiple synthetic data sets for public 
release. This approach stems from the related proposals [15] and [3]. [15] suggests 
generating synthetic data through multiple imputation^; [3] suggests generating 
synthetic data by bootstrap methods^. A decided advantage of the synthetic 
data approach is that valid inferences can be obtained using standard software 
and methods^. Furthermore, since the released data are synthetic, i.e., contain 
no data on actual units, they pose no disclosure risk. 

In practice, generating plausible synthetic values for all variables in a data- 
base may be difficult. This has led several authors to consider the creation 
of multiply-imputed, partially-synthetic data sets that contain a mix of actual 

^ This proposal is developed more fully in [8]. [9] provides a simulation study, [12] 
discusses inference, and [11] provides an application. 

^ [5] apply this method to categorical data; [4] use related concepts to develop a 
measure of disclosure risk. 

® In the case of multiply-imputed synthetic data, these methods are related to those 
applied to the analysis of multiply-imputed missing data, e.g., [14]. See [8] for details. 
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and imputed values. In partially synthetic data, confidential data are multiply- 
imputed, and disclosable data are released without perturbation. [6] pioneered 
this approach in the Survey of Consumer Finances. [2] adopt this approach to 
protect confidentiality in longitudinal linked data. [10] develops methods for 
valid inference, and [13] presents a nonparametric method to generate multiply- 
imputed, partially-synthetic data. 

We consider the case of longitudinal linked data. These are defined as mi- 
crodata that contain observations from two or more related sampling frames, 
with measurements for multiple time periods for all units of observation. They 
can be survey or administrative data, or some combination thereof. Our proto- 
typical example is longitudinal data on employers and employees. Employment 
relationships define the links between them. We are primarily interested in the 
problem of protecting confidentiality when data from all three sampling frames 
(employers, employees, and employment histories) are combined for statistical 
analysis, and when the links between sampling frames (a history of employ- 
ment relationships) are deemed confidential. In [2] we considered the case where 
the links between sampling frames were disclosable. In this paper we discuss 
multiply-imputing confidential links. We also present an improved method for 
multiply-imputing confidential characteristics of the sampled units. We apply 
a nonparametric transformation to each continuous confidential variable to im- 
prove the fit of the imputation model, and to better preserve marginal distribu- 
tions in the partially synthetic data. 

Longitudinal linked data present particular challenges for statistical disclo- 
sure limitation. Like all longitudinal data, they are characterized by complicated 
dynamic relationships between variables. However when data from multiple re- 
lated sampling frames are combined, these dynamic relationships span multiple 
frames. Furthermore, these data are generally composed of a mix of discrete and 
continuous variables, some with censored or truncated distributions. Finally, the 
links between sampling frames may themselves be deemed confidential. Protect- 
ing their confidentiality requires new methods. 

The remainder of the paper is organized as follows. Section 2 introduces nota- 
tion and discusses the [2] method for multiply-imputing confidential characteris- 
tics of units of observation when links between sampling frames are disclosable. 
Section 3 presents an improvement over these methods that better preserves 
marginal distributions. Section 4 considers the case where links between frames 
are confidential, and Sect. 5 concludes. Simulations and empirical results are 
forthcoming. 

2 Concepts 

2.1 Multiply-Imputed Partially Synthetic Data 

Consider a database with confidential elements Y and disclosable elements X 
Both Y and X may contain missing data. Using standard notation from the 

The database in question is defined quite generally, and the discussion in this section 

is not necessarily limited to longitudinal linked data. 
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missing data literature, let the subscript mis denote missing data and the sub- 
script obs denote observed data, so that Y = (Ymis, Yobs) and X = {Xmis, Xobs)- 
We assume throughout that the missing data mechanism is ignorable. 

The database is represented by the joint density p(Y,X,9), where 6 are 
unknown parameters. [2] suggest imputing confidential data items with draws Y 
from the posterior predictive density 

p{Y\Yobs,Xobs) = j p{Y\Xobs.0)p{O\Yobs,Xobs)d9 . ( 1 ) 

The process is repeated M times, resulting in M multiply-imputed partially 
synthetic data files , m = 1, ..., M. In practice, it may be easier to first 

complete the missing data using standard multiple-imputation methods and then 
generate the masked data as draws from the posterior predictive distribution of 
the confidential data given the completed data. For example, first generate M 
imputations of the missing data where each implicate to is a draw 

from the posterior predictive density 

P iYoriis^ Xos,is\Yobs ^ Xobs) — P {Y^^is: X^^s \Yobs,Xobs,9)p{e\Yobs,Xobs)de . 

( 2 ) 

With completed data T™ = (F^s,Yo&s) and X™ = (X^^^,Xobs) in hand, draw 
the partially synthetic implicate X™ from the posterior predictive density 

p{Y\Y^, X^) = J p(X|X™, 0)p {9\Y^, X™) d9 (3) 

for each imputation to. 

In practice, it can be very difficult to specify the joint probability distribution 
of all data, as in (1), (2), and (3). Instead, [2] approximate the joint densities 
using a sequence of conditional densities defined by generalized linear models. 
Doing so provides a way to model complex interdependencies between variables 
that is both computationally and analytically tractable. One can accommodate 
both continuous and categorical data by choice of an appropriate generalized 
linear model. The multiply-imputed partially synthetic data are drawn variable- 
by-variable from the posterior predictive distribution defined by an appropriate 
generalized linear model under an uninformative prior. If we let yk denote a 
single variable among the confidential elements of our database, imputed values 
ijk are drawn from 

p{%\Y^,x^) = |p(y,|x!a,x-,0fc)p(0fc|r-,x™)d0fe (4) 

where are completed data on confidential variables other than y^. 

2.2 Longitudinal Linked Data 

It is convenient to represent a longitudinal linked database as a collection T of 
files. Each file F ^ T contains longitudinal data from a single sampling frame. 
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Each file may contain both confidential and disclosable data elements. Observa- 
tions in different files are linked by a series of identifiers. An example serves to 
illustrate the structure of a longitudinal linked database. 

Our prototypical longitudinal linked database contains observations about 
individuals and their employers, linked by means of a work history. The work 
history contains data on each job held by an individual, including the identity 
of the employer. Suppose we have linked data on I employees and J employers 
spanning T periods. There are three data files. The first file U £ T contains 
longitudinal data on employees, with elements denoted Un for i = 1,...,/ and 
t = 1, Ti. The second data file Z & T contains longitudinal data on employers, 
with elements for j = 1, ..., J and t = 1, ...,Tj. The third data file W ^ T 
contains work histories, with elements Wijt- The data files U and W are linked 
by a person identifier. The data files Z and W are linked by a firm identifier, 
conceptualized by the link function j = J(i,t) that indicates the firm j at which 
worker i was employed at date t. For simplicity, assume that all work histories 
in W can be linked to individuals in U and firms in Z and that the employer 
link is unique for each (i,t) 

As discussed at length in [2], it is desirable to condition the imputation 
equations on all available data. In the context of longitudinal linked data, this 
includes data from all sampling frames. Thus when imputing variable in file 
F G conditioning information should include not only data elements in F, but 
also data from other files F' G T . This helps to preserve relationships among 
variables in the various files. Inevitably, some data reduction is required. We 
conceptualize these data reductions by functions g of data in files F' G F. 

It is frequently desirable to estimate separate imputation equations on sub- 
sets of the data, e.g., separate models for men and women, full-time and part- 
time workers, et cetera. We conceptualize these subsets as data configurations, 
indexed by c. A given configuration may also reflect the structure of available 
data. For example, to impute earnings in some period t, we may wish to con- 
dition on past and future values of earnings at that employer. Such data may 
not be available for every observation because of “structural” aspects of the 
employment history, e.g., the worker was not employed in the previous period. 

Let represent the likelihood of an appropriate generalized linear 

model for configuration c of variable yk G F. Under an uninformative prior, 
imputations are drawn from the posterior predictive density 









GF,X^G F, gl (F™ G F' , A™ G F') , 91) 

xp{9l\Y”^,X'^)d9l 



( 5 ) 



where represents other confidential data in F, and F' denotes the comple- 
ment of F in T . Note F^ may include measurements on y^ taken in other time 
periods. 

® The notation to indicate a one-to-one relation between work histories and individuals 
when there are multiple employers is cumbersome. Our application properly handles 
the case of multiple employers for a given individual during a particular sample 
period. 
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3 Preserving Marginal Distributions 
in the Partially Synthetic Data 

[2] discuss several enhancements to the above methods that improve the confi- 
dentiality protection or the analytic usefulness of the partially synthetic data. 
We present an additional one here, that helps preserve the marginal distributions 
of confidential variables. 

Under the variable- by- variable imputation method described above, an ap- 
propriate generalized linear model defines a parametric distribution for the vari- 
able yk under imputation, conditional on confidential and non-confidential data 
in all files. In many cases, the marginal distribution of yk is unknown or differs 
from the parametric family of the posterior predictive distribution of the im- 
putation model. This is problematic for generating multiply imputed, partially 
synthetic data using generalized linear models, since it can lead to discrepan- 
cies between the moments of the confidential data and the partially synthetic 
data. [2] found that their method preserved first and second moments of the 
confidential data. However, higher moments may be distorted if the posterior 
predictive distribution of the generalized linear model differs from the marginal 
distribution of yk- 

The usual solution, of course, is to take some analytic transformation of yk- 
For example, it is frequently argued that the earnings of white males have an 
approximately lognormal distribution. Thus a suitable imputation model might 
be a normal linear regression of the logarithm of earnings on other data items. 
The imputed values are normally distributed. Exponentiation returns them (ap- 
proximately) to the original location and scale. 

There are two important limitations to such a strategy. First, any error in the 
analytic transformation biases the distribution of the imputed values®. Second, 
no convenient analytic transformation may be available. We suggest a nonpara- 
metric transformation that addresses these limitations. 

Our transformation is conceptually simple, and is applicable to continuous 
variables in a variety of contexts. We consider the case where the imputation 
model is a normal linear regression, though other applications are possible. Un- 
der an uninformative or conjugate prior, the posterior predictive distribution is 
normal. If the marginal distribution of the confidential variable yk differs greatly 
from normality, the distribution of the imputed values will differ from that of 
the confidential data. The idea is to transform the confidential data so that they 
have an approximately normal distribution, estimate the imputation model on 
the transformed data, and perform the inverse transformation on the imputed 
values. The first step is to obtain an estimate of the marginal distribution of 
yk- Since we are in the case where the exact parametric distribution of yk is 
unknown, we suggest a nonparametric estimate, e.g. a kernel density estimate 
K. Provided sufficient data, this can be done for each data configuration c. For 



Error in the transformation is any difference between the distribution of the trans- 
formed variable and the posterior predictive distribution of the generalized linear 
model. 




Multiply-Imputing Confidential Characteristics 295 



each observation yk, compute the transformed value = \ K {yk)j , where 

denotes the standard normal CDF. By construction, the y'/. have a standard 
normal distribution. Then estimate the imputation regression on t/^, and draw 
imputed values from the posterior predictive distribution. The imputed values 
are normally distributed with conditional mean and variance defined by the re- 
gression model. To return the imputed values to the original location and scale, 
compute the inverse transformation ijk = K~^ (<? iVk)) ■ The imputed values 
yk are distributed according to K, preserving the marginal distribution of the 
confidential data. 

There is one caveat to the above discussion. The transformation and its 
inverse depend on the data. That is, the transformation function depends on 
an estimate K (yk), and hence contains model uncertainty (sampling error). To 
obtain valid downstream inference, we need to account for the additional uncer- 
tainty introduced by the transformation. A simple way to do this is to bootstrap 
the transformation. We therefore suggest an additional step. In each implicate 
m, draw a Bayesian bootstrap sample of values of yk, denoted y™, and compute 
the transformation y™' = iF™ (y™)- After drawing imputed values y™' from the 
appropriate posterior predictive distribution, perform the inverse transformation 

yr = 

Simulation results and an empirical application of this method are forthcom- 
ing. 

4 Protecting Confidential Links 
between Sampling Frames 

[2] considered the case where links between data files F & T were among dis- 
closable data elements X. In many situations this is unlikely to be the case. 
Returning to our prototypical longitudinal linked database, links between files 
constitute a history of employment relationships. From these one can compute 
a variety of statistics (e.y., the number of jobs held by an individual in each 
period, firm employment in each period, etc.) that can be used to identify em- 
ployers and employees in the partially synthetic data. Thus we now consider the 
case where links between data files are deemed confidential. Our suggestion is to 
treat these like other confidential data items, and multiply-impute them under 
appropriate generalized linear models. In the context of our prototypical longi- 
tudinal linked database on employers and employees, this amounts to imputing 
the link function j = J For a given worker i, this can be accomplished 
either by imputing the firm’s identity j in some period t; imputing the dates t 
associated with an employment spell at firm j; or both. We illustrate the method 
with an example taken from current research at the U.S. Census Bureau. 

An application of the procedure described in this paper is currently under- 
way at the U.S. Census Bureau, using data from the Longitudinal Employer- 
Household Dynamics (LEHD) program. These are administrative data built from 
quarterly unemployment insurance (UI) system wage reports. They cover the 
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universe of employment at businesses required to file quarterly UI reports - esti- 
mated to comprise more than 96 percent of total wage and salary civilian jobs in 
participating states. See [1] or [7] for a detailed description of the data. The file 
structure corresponds to that of the prototypical database described in Sect. 2.2. 
To protect confidentiality of the employment history, our objective is to ensure 
that any person- or firm-level summary of the history is perturbed. To do so, 
it is sufficient to multiply-impute the identity of at least one employer in each 
individual’s employment history, and the start and end dates of all employment 
spells. 

To multiply-impute (at least) one employer’s identity in the individual’s em- 
ployment history, we use a logistic regression model that conditions on estab- 
lishment employment and detailed employer and employee geography. The set 
of candidate “donor” establishments is restricted to businesses operating in the 
same county and detailed industry, in some cases stratified by employment. De- 
note the set of such firms by Conditioning variables for the regression 

include establishment employment, characteristics of the within-establishment 
wage distribution, and the worker’s physical proximity to the business. Denote 
the vector of these characteristics by Xijt- The imputation model is based on 



Pr(J (i,t) =j\j G 



exp {ajt + xCt/3} 

exp{afct + a;'fc,/J} 



(6) 



where ajt is a firm and time specific effect. 

We also multiply-impute the start and end date of each employment spell. We 
can represent the employment history of an individual at a particular employer 
by a binary string. Each digit of the string corresponds to one quarter in the 
sample period. It takes value 1 if the worker was employed at that business in that 
quarter, and 0 otherwise. The imputation model is a binary logit for employment 
in a given quarter, conditional on characteristics from all sampling frames and 
whether the individual was employed at that business in the four previous and 
subsequent quarters. We multiply-impute an individual’s employment status at 
the business for each quarter in a window around the employment spell’s start 
and end date. This perturbs the start and end dates of the spell, but constrains 
them to lie within a fixed interval of the true values. It can also fill or create 
short gaps in the employment spell, in a manner consistent with observed spells. 



5 Summary 

Multiply-imputed partially synthetic data hold great promise for statistical agen- 
cies and analysts alike. They satisfy the statistical agency’s need to protect the 
confidentiality of respondents’ data, while preserving the analyst’s ability to 
perform valid inference. In the context of longitudinal linked data, the synthetic 
data approach is particularly appealing. It is sufficiently flexible to maintain 
complex relationships between variables in various sample frames. As demon- 
strated in this paper, it is also possible to preserve the marginal distribution of 
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confidential variables in the partially synthetic data. Furthermore, the synthetic 
data approach is adaptable to protecting confidential links between frames. The 
application of these methods to the LEHD database promises further refinement 
of the techniques discussed in this paper. This application will further demon- 
strate their ability to protect confidentiality and preserve valid inferences, and 
will demonstrate the practicality of the synthetic data approach. 
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Abstract. Generation of a synthetic microdata set that reproduces the 
statistical properties of an original microdata set is a promising approach 
to statistical disclosure control (SDC) of microdata. In this paper, a new 
method for generating continuous synthetic microdata is proposed. The 
covariance matrix and the univariate statistics of the original data set 
are exactly preserved. The method is non-iterative and its complexity 
grows linearly with the number of records to be protected. 



1 Introduction 

Statistical databases can either contain tabular data or individual data (mi- 
crodata). Microdata can be continuous, e.g. age and weight, or categorical, for 
instance sex or hair color. When a microdata set is to be released for public 
use, confidentiality must be ensured. In that sense, the purpose of Statistical 
Disclosure Control (SDC) techniques is twofold: on one hand, SDC methods 
must prevent the identity of the individual respondent from being disclosed; on 
the other hand, the published set of data should preserve as many statistical 
properties as possible from the original set. 

One possibility for protecting a microdata set is to use a masking method 
{e.g. additive noise, microaggregation, etc., cf. [1]) to transform original data 
into protected, publishable data. An alternative to masking the original data is 
to generate a new data set (a synthetic data set) not from the original data, but 
from a set of random values that are adjusted in order to fulfill certain statistical 
requirements. A third possibility is to build a hybrid data set as a mixture of the 
masked original values and a synthetic data set [2]. 



1.1 Background on Synthetic Data Generation 

Publication of simulated - i. e. synthetic - data was proposed long ago as a way to 
guard against statistical disclosure. In fact, as early as 1993, Rubin [3] suggested 
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creating an entirely synthetic data set based on the real survey data and multiple 
imputation. Specific case studies of synthetic microdata generated by multiple 
imputation were presented in [4, 5] . Although the results were fairly promising, 
the multiple imputation approach requires complex models and software, which 
greatly reduces its appeal in many situations. 

In [6, 7] comparisons were presented for measuring the performance of micro- 
data masking methods in terms of information loss and disclosure risk. Based 
on the proposed measures, it was shown in [8] how to improve the performance 
of any particular masking method. In particular, post-masking optimization was 
discussed for preserving as much as possible the moments of first and second 
order (and thus multivariate statistics) without increasing the disclosure risk. 
The technique proposed could also be used for synthetic microdata generation 
and could be extended for preservation of all moments up to m-th order, for 
any m. The shortcoming of this approach is its computational complexity: the 
optimization problem is solved using an iterative refinement approach, which 
may be quite time-consuming when the involved data sets are large. 

Latin Hypercube Sampling (LHS) appears in the literature as another method 
for generating multivariate synthetic data sets. In [9], authors improve the LHS 
updated technique of [10], but the proposed scheme is still time-intensive even 
for a moderate number of records. In [II], LHS is used along with a rank corre- 
lation refinement to reproduce both the univariate (z.e. mean and variance) and 
multivariate structure (in the sense of rank correlation) of the original data set. 
This method also permits flexibility in the size of the synthetic data set that 
is generated. In summary, LHS-based methods rely on iterative refinement, are 
time-intensive and their running time does not only depend on the number of 
values to be reproduced, but on the starting values as well. 

1.2 Contribution and Plan of This Paper 

In this paper, a non-iterative method for generating continuous synthetic micro- 
data is proposed. The implementation of this method results in a fast algorithm 
which exactly reproduces the means and the covariance matrix of the original 
data set and whose running time grows linearly with the number of records. Exact 
preservation of the original covariance matrix implies that variances and Pear- 
son correlations are also exactly preserved in the synthetic data set. Like in any 
synthetic data generator, the number of records in the synthetic data set can 
differ from the number of records in the original data set. 

Section 2 describes our proposal for generating synthetic data. Section 3 
analyzes the complexity and the data utility properties of the proposed method. 
Empirical results are reported in Section 4. Finally, Section 5 contains some 
conclusions. 

2 A Low-Cost Method 

for Synthetic Microdata Generation 

Let X be an original microdata set, with n records and m variables. Let X' be 
the synthetic microdata set to be generated, with n' records and m variables. In 
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fact, X can be viewed as an n x m matrix and X' can be viewed as an n' x m 
matrix. The method presented in this section will ensure that both univariate 
and multivariate statistical properties of X, such as mean and covariance, are 
exactly reproduced in the resulting X' . 

The algorithm below constructs X' from X-. 

Algorithm 1 (Basic Procedure) 



1. Generate A, which is a random n' x m matrix, such that the covariance 
matrix of A is the identity matrix. 

2. Compute the covariance matrix C of the original microdata matrix X. 

3. Use the Cholesky decomposition on C to obtain 

C=U* xU 

where U is an upper triangular matrix and U* is the transposed version of 
U. 

4- Obtain the synthetic microdata set X' as a matrix product: 

X' = A- U 

Note that the covariance matrix of X' equals the covariance matrix of X [12]. 
5. Due to the construction of matrix A, the mean of each variable in X' is 

0. In order to preserve the mean of variables in X, a last adjustment is 
performed. If xj be the mean of the j-th variable in X, then Xj is added to 
the j-th column (variable) of X' : 

xU := xU + Xj for i = 1, - ■ ■ ,n' and j = 1, • • • , to (1) 

We now need to specify how to construct a random n' x m matrix A, whose 
covariance matrix is the m x m identity matrix. 

Algorithm 2 (Construction of Matrix A) 



1. Generate A as an n' x m matrix with random elements Oij. View the to 
columns of A as samples of variables Ai,--- ,Am. If Cov(Aj, Aj') is the 
covariance between variables Aj and Aji , the objective of the algorithm is 
that 

Cov{Aj,Aj.) = \^^ ot/iermse 

for j,f G {I,-- - ,to}. 

2. Let di be the mean of A\. Let us adjust A\ as follows: 

aip . — ai % — 1, ... , 77 - 



The mean of the adjusted A\ is 0. 
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3. In order to reach the desired identity covariance matrix, some values of vari- 
ables A_ 2 , • • • , Am must change. For v = 2 to m do: 

(a) Let dy be the mean of variable Ay. 

(b) For j = 1 to V — 1, the covariance between variables Aj and Ay is 



Cov{Aj,Ay) 



E n 

i— 



2 = 1 



-0 



E n 
2=1 



' ^2,21 
n' 



(c) In order to obtain Cov{Aj, Ay) = 0, j = 1 .. .v — 1 , some elements 
Oi^y in the v-th column of A are assigned a new value. Let Xi, . . . 
be the unknowns for the following linear system of v — 1 equations: 



E n' -v-\-l 
i=l 



ai 



I — 1 

■ ~T 2^i—l — 21+1 + 2, ' ^2 



= 0 for j = 1 . . .V — 1 



that 



n' -v-\-l v — 1 

aij ■ Oi^y + ^ an'-y+i+i,j ■ Xi = 0 for j = l...v-l 
i=l i=l 



Once the aforementioned linear system is solved, the new values are as- 
signed: 

Qn'-y+i+i^y := Xi fori=l...v-l 

(d) Let dy be the mean of variable Ay. A final adjustment on Ay is performed 
to make its mean 0: 



ai,y = Oi^y - ay for i= 1. . .n' 

4-. In the last step, values in A are adjusted in order to reach Cov{Aj, Aj) = 1 
for j = 1 . . .m. If aj is the standard deviation of variable +, the adjustment 
is computed as: 

^+7 • 1 / • 1 

aij := — i = 1 . . .n = 1 . . .m 

With the construction proposed in this section, the number of records n' 
in X' does not depend on the number of records n va. X. Thus, disclosure of 
n is prevented, which may be useful in some situations. On the other hand. 
Algorithm 2 does not need to be run each time Algorithm 1 is run. In other words, 
if X\, X 2 , . . . Xy are original microdata sets, each with Ui records, i = 1 .. .u, and 
m variables, then u synthetic microdata sets X[,X 2 , . . . X'y can be generated, 
each with n' records and m variables, with a single n' x m matrix A. 



3 Properties of the Proposed Scheme 

3.1 Performance and Complexity 

To simplify the performance and complexity analysis presented here, we assume 
that a synthetic data set of size n x m is generated from an original data set 
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Table 1. Running time (in seconds) on a 1.7 GHz desktop Intel PC under a Linux OS. 
Note that time for random matrix generation is included 



Number of records 


Number of variables 
5 10 25 50 


1,000 


0.00 0.00 0.05 0.31 


10,000 


0.05 0.19 1.26 5.31 


100,000 


0.49 1.93 12.41 51.15 



of the same size, i.e. n' = n. The method has been tested with several data set 
sizes and execution times are shown in Table 1. 

The computational complexity for the proposed method will next be esti- 
mated. Let n be the number of records, m the number of variables and assume 
for simplicity n' = n. Then the complexities of the various operations are as 
follows: 

— Calculation of the covariance matrix: 0{n + rn?). 

— Cholesky decomposition: 0(m^/6) (see [13]). 

— Calculation of A: 0{2nm + 2m^ + 2m'^ /3), where the term 2m^/3 is the cost 
of solving a Gauss system m times [13]. 

— Matrix product: 0{nrn^). 

— Mean adjustment: 0(nm). 

In summary, the overall complexity is 0{nm -\- 2m^/3) = 0{n -\- rrA). To 
understand this complexity, one should realize that, in general, the number of 
records n is much larger than the number of variables m, i.e. n » m. The strong 
point of this proposal is that its complexity is linear in the number of records. It 
must also be kept in mind that, as pointed out at the end of Section 2, matrix 
A can be re-used to generate several synthetic microdata sets, which greatly 
reduces computation. 



3.2 Data Utility 

As stated in Section 1.2, the proposed scheme exactly reproduces the statistical 
properties of the original data set. In particular: 

— The means of variables in the original data set X are exactly preserved in 
the synthetic data set X'. 

— The covariance matrix of X is exactly preserved in X' (see [12]). Thus, in 
particular: 

• The variance of each variable in X is preserved in X' . 

• The Pearson correlation coefficient matrix of X is also exactly preserved 
in X' , because correlations are obtained from the covariance matrix. 
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Table 2. Values of Score’ for the synthetic data 



Measure 


Value 


ILi 


544.53 


IL 2 


3.53565e-05 


IL 3 


3.25579e-04 


IL 4 . 


1.81034e-03 


IL 5 


1.90171e-04 


IL 


108.90 


DLD 


9.10 


ID 


33.21 


Score' 


65.0338 



4 Empirical Work 

4.1 The Score 

In the experiments conducted to measure the disclosure risk and the information 
loss in the synthetic data sets produced by our method, we use the Score' defined 
in [2]. Score' is a modification of the original Score defined in [7] to deal with 
synthetic data generation in which the number of records of the synthetic data 
set differs from the number of records in the original data set. We briefly recall 
the definition of Score''. 

Score' = 0.5 • IL + 0.25 • DLD + 0.25 • ID 

where IL stands for information loss, DLD refers to distance-based record link- 
age and ID stands for interval disclosure. DLD and ID are disclosure risk 
measures. IL measures how different is the synthetic data set from the origi- 
nal one. IL\ is a component of IL which compares the individual original and 
synthetic values, whereas the remaining components for IL reflect how different 
are univariate and multivariate statistics between X and X' . See [2] on how to 
compute IL, DLD and ID. 

4.2 The Data Set 

The microdata set for testing was constructed using the Data Extraction System 
of the U.S. Census Bureau (http://www.census.gov/DES) and contains n = 1080 
registers for m = 13 continuous variables. This data set was also used in [2, 6, 7]. 

4.3 The Results 

As mentioned in Step 1 of Algorithm 2, A is initially composed of random values. 
It must be noticed that whatever the magnitude of the values in X is, the range 
in which the initial random values for A are picked - say between 0 and 100 - 
does not affect the results. The score for a typical execution is shown in Table 2. 

Note that, since most of the values in A are random, the result for ILi shows 
that most of values in X are substantially different in X'. The point is that the 
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Table 3. DLD and ID values for synthetic data sets with n' records 



Number of records 


DLD ID 


500 


13.40 33.27 


2,000 


12.65 45.65 


8,000 


15.14 34.14 


10,000 


12.54 34.77 


20,000 


11.86 39.60 



remaining components IL^, IL^, IL^ IL^ of the information loss show that the 
statistical properties listed in Section 3.2 are exactly fulfilled (those measures 
do not appear as exactly 0 due to rounding errors). On the other hand, the 
disclosure risk measures DLD and ID are lower than those obtained for the 
LHS-Based method and reported in [2]. 

Due to the randomness of matrix A, different runs of the method will result in 
different synthetic data sets and, consequently, the resulting S'core’ will change. 
In 10 executions, an average value of 12.14 for DLD was obtained, with a stan- 
dard deviation of 0.97; the average obtained for ID was 37.99 with a standard 
deviation of 5.11. 

If synthetic data sets are generated whose number n' of records is not the 
same as the number n of original records, the values for DLD and ID are 
maintained (see Table 3). Hence, the disclosure risk measures do not depend on 
the number of records of the synthetic data set. 

4.4 Non-random Matrix A 

In order to reduce the information loss component IL\ (individual record com- 
parison), one could think of choosing the initial values of matrix H in a “clever” 
way rather than using initial random values. For example, A could be the re- 
sult of masking the original data set X using a perturbative masking method 
(see [1]). This leads to a number of records in X' which equals the the number 
of records in X. 

Table 4 shows the results obtained when different perturbative masking meth- 
ods are used. The lowest value for DLD and ID is reached when A has been 
obtained using microaggregation [14] with parameter k = 20 The lowest value 
for IL occurs for additive noise with parameter 2%. 

5 Conclusions 

In this paper, a new method for generating a synthetic data set X' from a 
microdata set X has been presented. This method, specified in Algorithms 1 
and 2, is suitable for continuous microdata. 

In addition to allowing a different number of records in the original and the 
synthetic data sets (a property shared by all synthetic data generation methods), 
the main properties of the method are: 
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Table 4. Results when using a masked data set as A 



Masking method 


Score' 


IL' 


OLD' 


ID' 


noise. 02 


36.42 


36.41 


20.85 


52.00 


noise. 08 


38.04 


40.69 


19.48 


51.32 


noise. 14 


35.61 


36.50 


18.20 


51.24 


microag.k=5 


37.74 


39.70 


20.19 


51.37 


microag.k=10 


40.96 


45.21 


23.28 


50.15 


microag.k=20 


37.42 


42.85 


16.30 


47.68 


rankswap.5 


38.76 


43.89 


16.77 


50.50 


rankswap.l5 


42.28 


50.77 


16.93 


50.64 



— Its computational complexity is linear in the number of records. The imple- 
mentation of Algorithms 1 and 2 is simple and executions are efficient. One 
of the computations is solving m — 1 linear systems, where m is the num- 
ber of variables of X and X' (note that the number of variables is usually 
much smaller than the number of records). The remaining computations, i.e. 
Cholesky decomposition, matrix product, etc., are also of low complexity. 

— Algorithms 1 and 2 are non-iterative, so that the number of steps for obtain- 
ing X' can be known a priori. Other methods for obtaining synthetic data 
are based on iterative algorithms whose running time cannot be predicted 
before execution. 

— The following statistical properties of the original data set X are preserved 
by the synthetic data set X': mean, variances, the covariance matrix and the 
Pearson correlation matrix. Note that a good deal of methods for synthetic 
data generation in the literature do not preserve the variance-covariance 
matrix nor the Pearson correlation matrix. 

— Empirical work shows that the disclosure risk is lower than for masking 
methods. 

As shown in Section 4.3, IL is higher than for usual masking methods when 
the initial values used for A are random. To keep ILi low while preserving 
the covariance matrix and the univariate statistics, initial values for A can be 
obtained using a perturbative masking method. 
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Abstract. This paper presents the results of an empirical test on nn- 
merical microaggregation using /r- Argus 3.2. Several scenarios are defined 
as combination of different parameters: sample size, number of variables 
used in the microaggregation and number of records per group. The 
main aim of this work is Ending a trade-off between disclosure risk and 
information loss. 



1 Introduction 

As the title of this paper shows, the objective of this paper is to publish the 
results of a test on microaggregation techniques implemented in y:i-Argus 3.2 
using real data and, more specifically, business data, a new aspect in the analysis. 
Particularly, it has been tested the microaggregation parameters that, according 
to the present bibliography, have provided the best results. Moreover, it’s very 
important to remark that this work analyses the potentiality of /x- Argus using a 
high number of records. 

The analysis has been focused on the trade-off between the information loss 
and the disclosure risk. In this sense, the methods used in order to calculate these 
measures are based on Torres [6]. This empirical work, jointly with the works 
developed by Domingo et al. ([2], [5]) have allowed to implement the present 
algorithm that Argus provides about microaggregation. 



2 Data Description 

The analysis has been done using the Agriculture Census of Catalonia (1999). 
The original file contains 77839 records, that is to say, a large file that has allowed 
us to test the potentiality of /i- Argus and, moreover, to test the microaggregation 
techniques. 

A common SDC method is the release of only a sample of the records. So, two 
independent samples with the 5% and 10% of the records have been produced. 
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This way, three different analyses are produced under the point of view of the 
sampling rate. 

The names and number of records of the files with the original data are: 

CA.TOTAL.asc 77839 census file 

CA_S AMPLE 10. asc 7784 10% sample 

CA_SAMPLE5.asc 3892 5% sample 

The file contains 2 qualitative variables and 7 quantitative variables that 
have been chosen for the analysis: 

- PROV: province (4 values) 

- OTE: main product (66 values) 

- SUP: total agricultural area 

- SAU: utilised agricultural area 

- SREG: irrigated agricultural area 

- UTA: annual work units 

- UTA A: non- family annual work units 

- UR: livestock units 

- MET: total gross margin 

Three data files have been mentioned, one census file and two sample files. 
The census file contains all the records. Two criterions have been applied in the 
construction of the sample files: 

1. All the records with three or more zeros in their numerical variables have 
been deleted from the original file because they could bias the results from 
a file with a number of records not very high as sample files are. 

2. After that, a random number has been added to each record and the file 
has been sorted by this new variable. In order to generate the sample files, 
the first records (5% of the original total), and the last records (10% of the 
original total) have been selected. This way, none of the records is in the two 
sample files in order that the comparison of the results would be not affected 
by a group of common records. 

3 Parameters of Analysis 

At this stage, three files are available for the analysis, but two more parameters 
have been introduced in order to do a more exhaustive testing. 

1. Number of variables applied in the microaggregation: first of all, it has been 
decided how many and which variables had to be microaggregated. In this 
sense, 4 different combinations have been applied: 

- all of the 7 numerical variables 

- a combination of 3 -I- 4 variables (microaggregation in 2 steps) 

- a combination of 3 -I- 2 -|- 2 variables (microaggregation in 3 steps) 

- a combination of 4 -|- 3 variables (a different microaggregation in 2 steps) 
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The decision about which variables are chosen has been product of the pre- 
vious knowledge about the characteristics of each variable from experts at 
Idescat. So, the 3 combinations of variables grouped are: 

- 3-1-4 variables combination: 

- first microaggregation step: SUP, SAU, SREG 

- second microaggregation step: UTA, UTAA, UR, MET 

- 3 -I- 2 -I- 2 variables combination: 

- first microaggregation step: SUP, SAU, SREG 

- second microaggregation step: UTA, UTAA, 

- third microaggregation step: UR, MET 

- 4-1-3 variables combination: 

- first microaggregation step: SUP, SREG, UTA, UTAA 

- second microaggregation step: SAU, UR, MET 

2. Number of records per group (k): four different options have been applied: 

- k=3 
k=5 

- k=10 

- k=15 



Summing up, the total of performed analysis is 48 (3 files * 4 combinations 
of variables * 4 different k values) 

We want to remark that the analysis has been based on the previous works 
([2], [4], [5], [6]) and so, the parameters values have been selected according to 
the conclusions of these works. 

4 Analysis Phases 

The analysis done in order to get the aims of this deliverable is divided in two 
stages: 

1. Greation of the microaggregated file (/r-Argus) 

2. Measure of the information loss and disclosure risk (SAS) 

4.1 Microaggregation Files 

Microaggregation is a family of statistical disclosure techniques for quantitative 
microdata, which belong to the substitution/perturbation category [1]. To obtain 
microaggregates in a microdata set with n records, these are combined to form 
g groups of size at least k. For each variable, the average value over each group 
is computed and is used to replace the original averaged values. For a microdata 
set consisting of many variables, these can be microaggregated together or par- 
titioned into several groups of variables, /i- Argus [3] implements a multivariate 
fixed-size microaggregation method that tries to form homogeneous groups of 
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records by taking into account the distances between records themselves and 
between records and the average of all records in the data set. 

In order to improve the analysis of the results, the names of the 48 files 
created by /r- Argus are shown in the next table: 



Table 1. 





number of 


number of variables 


File 


records per group 


7 


3 + 4 


3 + 2 + 2 


4 + 3 




k=3 


MAT_7_3 


MAT_34_3 


MAT_322_3 


MAT_43_3 


CA_TOTAL 


k=5 


MAT_7_5 


MAT_34_5 


MAT_322_5 


MAT_43_5 




k=10 


MAT_7_10 


MAT_34_10 


MAT_322_10 


MAT_43_10 




k=15 


MAT_7_15 


MAT_34_15 


MAT_322_15 


MAT_43_15 




k=3 


MA10_7_3 


MA10_34_3 


MA10_322_3 


MA10_43_3 


CA_SAMPLE10 


k=5 


MA10_7_5 


MA10_34_5 


MA10_322_5 


MA10_43_5 




k=10 


MA10_7_10 


MA10_34_10 


MA10_322_10 


MA10_43_10 




k=15 


MA10_7_15 


MA10_34_15 


MA10_322_15 


MA10_43_15 




k=3 


MA5_7_3 


MA5_34_3 


MA5_322_3 


MA5_43_3 


CA_SAMPLE5 


k=5 


MA5_7_5 


MA5_34_5 


MA5_322_5 


MA5_43_5 




k=10 


MA5_7_10 


MA5_34_10 


MA5_322_10 


MA5_43_10 




k=15 


MA5_7_15 


MA5_34_15 


MA5_322_15 


MA5_43_15 



4.2 Measure of the Information Loss and Disclosure Risk 

Quality of a microaggregation method can be obtained from information loss 
due to the publication of non-original data and disclosure risk. Some measures 
are defined according to Torres [6]. 

Let’s suppose a set of microdata corresponding to n individuals Ii, I 2 , ■■■, In 
and p continuous variables Zi, Z 2 , ■■■, Zp. 

Let X be a matrix of n rows and p columns representing the original micro- 
data set and X’ a matrix of n rows and p columns representing the modified 
microdata set. 

X,X'are p-dimensional vectors corresponding to X and X’ means. V, V’ are 
the p X p covariates matrices corresponding to X and X’. R, R’ are the p x p 
correlation matrices corresponding to X and X’ 



Information Loss Measures (PI): Information loss can be measured as a 
function of structural differences between X and X’. Five different measures are 
defined: 

PIl: mean variation of data X — X' 



PIl = 100 • 



p n 

EE 

i = l i=l 



\Xij I 



np 
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Note: if Xij = 0 and ^ 0, then divide by If Xij = xb = 0, the term 
is not added to the sum. This rule is applied to the rest of formulas. 

PI2: mean variation of data means X — X' 



P/2 = 100 • 



l%l 

a=i 



P 

PIS: mean variation of data covariates V —V 



P/3 = 100 • 



V 

E E 

a=i i<i<f 



hij I 



p(p+i) 

2 



PIT: mean variation of data variances S — S' 



P/4 = 100 



E 

i=i 



hii I 



p 



PIS: Mean absolute error of data correlations R — R' 



P/5 = 100 



E E 

j=l l<i<j 

P(P-I) 

2 



Global information loss PI is defined and computed as a weighted mean of 
PIl, PI2, PIS, PIT and PIS: 

PI = ip/1 + (^P/2 + ^P/4) + (ip/3 + ip/5) 
Job D D 



Disclosure Risk Measures: The analysis of a disclosure control method can- 
not be reduced to information loss: the loss of confidentiality due to the dissem- 
ination of modified microdata must be also analysed. Three different measures 
are used in order to obtain a global disclosure risk measure: ERD, ICN and ICD. 
The global disclosure risk measure is defined as 

PC= -PRB + (-ICX + ^ICD) 

2 4 4 

ERD: Record Matching Based on Distances: 

This measure is based on the idea of matching records. This method supposes 
that an intruder has a set of external microdata Y containing a subset of key 
variables that are common to the modified microdata X’. The intruder tries to 
match the modified microdata X’ and the external microdata Y using the subset 
of common variables in order to discover original data X. For every record x' in 
the modified file, distances to all the records {xk}/^^^ „ in the original file are 
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calculated using only the subset of common variables (variables are normalised 
prior to the calculation of distances). If the modified record x' and its closer 
original record Xj are the same record, {i = j), a match is produced. ERD is 
defined as the percentage of matched records. 

ICN: Confidentiality Interval on Number of Records: 

This measure is based on the idea of building intervals on the modified microdata. 

Each variable Xj is sorted independently of the others. For each value xC and 
variable X'j a centred interval /b is built containing at most q% of the number 
of records (q is fixed). A record Xi is matched if for all its variables j=l,...,p, 
Xij £ I-j. ICN is defined as the percentage of matched records. 

ICD: Confidentiality Interval on Standard Deviation: 

This measure is similar to ICN, building centred intervals /b of range at most q% 
of the standard deviation of each variable Xj (q is fixed). A record Xi is matched 
if for all its variables j=l,...,p, Xij £ I-j. ICD is defined as the percentage of 
matched records. 

Quality Global Measure: A global measure must give the same importance 
to information loss and disclosure risk. So, a global measure is defined as 



MG = 0.5 • PI + 0.5 • PC = 0.5 • PI + 0.25 • ERD + 0.125 • ICN + 0.125 • ICD 

Every term of the sum MG belongs to the interval [0, 100], except PI that 
could be higher than 100. The next rule could be useful for the understanding of 
MG values: publishing the original microdata using no disclosure control method, 
would produce no information loss (PI=0) but a total revelation risk (PC=100). 
In that case, MG=50. As a consequence, any method with a value of MG greater 
than 50 would be useless; and the lower the value of MG, the better. 

An alternative global measure can be used, as some authors suggest [7], in 
order to test the validity of the results. As disclosure risk is an important topic 
for statistical offices, it is defined 

MG2 = i • P/ + ? • PC' 

Table 2 shows all the measures that have been calculated: 

Notes on the Implementation of Measures: 

For implementing ERD (Record matching based on distances), 7 different sce- 
narios have been defined, depending on the variables known by the intruder: 

ERD-1: One common variable: SUP 
ERD-2: Two common variables: SUP, SAU 
ERD-3. Three common variables: SUP, SAU, UTA 
ERD-4: Four common variables: SUP, SAU, UTA, UR 
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Table 2. 



File 


sample% 


variables 


k 


mg 


mg2 


pi 


pc 


erd 


icn 


icd 


MAT_7_3 


100 


7 


3 


20,18 


23,15 


11,27 


29,09 


12,35 


27,94 


63,72 


MAT_7_5 


100 


7 


5 


18,88 


19,83 


16,00 


21,75 


6,91 


15,62 


57,57 


MAT_7_10 


100 


7 


10 


20,37 


19,00 


24,49 


16,25 


3,44 


7,71 


50,39 


MAT_7_15 


100 


7 


15 


22,33 


19,58 


30,59 


14,08 


2,33 


4,97 


46,69 


MAT_34_3 


100 


3 + 4 


3 


35,19 


45,74 


3,54 


66,84 


53,81 


76,56 


83,18 


MAT_34_5 


100 


3 + 4 


5 


31,92 


40,79 


5,30 


58,54 


47,56 


60,70 


78,33 


MAT_34_10 


100 


3 + 4 


10 


28,45 


34,97 


8,87 


48,02 


39,51 


40,67 


72,40 


MAT_34_15 


100 


3 + 4 


15 


27,06 


32,02 


12,20 


41,93 


34,95 


28,88 


68,94 


MAT_322_3 


100 


3 + 2 + 2 


3 


40,17 


53,00 


1,68 


78,66 


63,60 


92,80 


94,64 


MAT_322_5 


100 


3 + 2 + 2 


5 


38,79 


50,53 


3,60 


73,99 


60,22 


84,65 


90,87 


MAT_322_10 


100 


3 + 2 + 2 


10 


37,26 


48,10 


4,74 


69,78 


56,39 


77,87 


88,49 


MAT_322_15 


100 


3 + 2 + 2 


15 


36,08 


46,07 


6,10 


66,06 


53,76 


70,90 


85,82 


MAT_43_3 


100 


4 + 3 


3 


34,94 


45,13 


4,37 


65,51 


58,25 


59,89 


85,64 


MAT_43_5 


100 


4 + 3 


5 


31,27 


39,42 


6,82 


55,72 


46,82 


49,30 


79,95 


MAT_43_10 


100 


4 + 3 


10 


36,85 


47,70 


4,31 


69,39 


61,22 


68,94 


86,18 


MAT 43 15 


100 


4 + 3 


15 


26,18 


29,88 


15,09 


37,28 


32,15 


17,26 


67,55 


MA10_7_3 


10 


7 


3 


25,29 


22,99 


32,19 


18,39 


13,85 


8,31 


37,55 


MA10_7_5 


10 


7 


5 


32,02 


25,35 


52,06 


11,99 


7,82 


2,90 


29,39 


MA10_7_10 


10 


7 


10 


48,71 


34,92 


90,07 


7,35 


3,98 


0,80 


20,64 


MA10_7_15 


10 


7 


15 


54,13 


37,91 


102,81 


5,46 


2,70 


0,27 


16,15 


MA10_34_3 


10 


3 + 4 


3 


31,41 


38,93 


8,86 


53,96 


56,92 


39,95 


62,06 


MA10_34_5 


10 


3 + 4 


5 


29,23 


33,80 


15,49 


42,96 


48,63 


20,94 


53,66 


MA10_34_10 


10 


3 + 4 


10 


29,11 


29,98 


26,51 


31,72 


37,57 


6,89 


44,86 


MA10_34_15 


10 


3 + 4 


15 


28,72 


27,94 


31,05 


26,39 


31,57 


2,54 


39,86 


MA10_322_3 


10 


3 + 2 + 2 


3 


38,94 


50,00 


5,73 


72,14 


67,39 


74,38 


79,41 


MA10_322_5 


10 


3 + 2 + 2 


5 


37,11 


46,25 


9,70 


64,52 


63,32 


59,17 


72,25 


MA10_322_10 


10 


3 + 2 + 2 


10 


35,91 


41,77 


18,32 


53,50 


57,14 


37,63 


62,09 


MA10_322_15 


10 


3 + 2 + 2 


15 


34,51 


38,75 


21,80 


47,23 


53,04 


25,36 


57,46 


MA10_43_3 


10 


4 + 3 


3 


32,78 


37,29 


19,24 


46,31 


51,81 


24,00 


57,61 


MA10_43_5 


10 


4 + 3 


5 


32,91 


34,16 


29,17 


36,65 


43,32 


11,51 


48,45 


MA10_43_10 


10 


4 + 3 


10 


55,81 


46,06 


85,08 


26,55 


32,74 


2,94 


37,76 


MA10 43 15 


10 


4 + 3 


15 


56,27 


44,67 


91,08 


21,46 


26,66 


0,95 


31,58 


MA5_7_3 


5 


7 


3 


25,70 


22,77 


34,46 


16,93 


14,38 


5,42 


33,56 


MA5_7_5 


5 


7 


5 


30,55 


23,91 


50,48 


10,62 


8,32 


1,49 


24,33 


MA5_7_10 


5 


7 


10 


41,12 


29,48 


76,03 


6,21 


4,19 


0,15 


16,29 


MA5_7_15 


5 


7 


15 


47,50 


33,16 


90,51 


4,48 


2,88 


0,13 


12,05 


MA5_34_3 


5 


3 + 4 


3 


32,44 


38,20 


15,15 


49,73 


55,81 


30,68 


56,63 


MA5_34_5 


5 


3 + 4 


5 


34,42 


35,84 


30,13 


38,70 


46,39 


14,54 


47,48 


MA5_34_10 


5 


3 + 4 


10 


38,66 


35,07 


49,42 


27,90 


35,03 


3,24 


38,28 


MA5_34_15 


5 


3 + 4 


15 


40,16 


34,36 


57,58 


22,75 


28,55 


0,77 


33,12 


MA5_322_3 


5 


3 + 2 + 2 


3 


39,62 


49,06 


11,29 


67,94 


66,90 


64,16 


73,79 


MA5_322_5 


5 


3 + 2 + 2 


5 


36,42 


44,10 


13,37 


59,46 


62,29 


48,07 


65,21 


MA5_322_10 


5 


3 + 2 + 2 


10 


37,49 


40,96 


27,09 


47,90 


55,08 


25,77 


55,65 


MA5_322_15 


5 


3 + 2 + 2 


15 


38,80 


39,60 


36,40 


41,20 


50,55 


14,34 


49,36 


MA5_43_3 


5 


4 + 3 


3 


29,40 


33,88 


15,97 


42,84 


50,99 


17,16 


52,21 


MA5_43_5 


5 


4 + 3 


5 


29,05 


30,43 


24,92 


33,18 


42,01 


6,32 


42,37 


MA5 43 10 


5 


4 + 3 


10 


32,51 


29,46 


41,66 


23,36 


30,78 


1,31 


30,58 


MA5_43_15 


5 


4 + 3 


15 


34,73 


29,12 


51,56 


17,90 


23,73 


0,31 


23,84 



ERD-5: Five common variables: SUP, SAU, UTA, UR, MET 

ERD-6: Sbc common variables: SUP, SAU, UTA, UR, MET, SREG 

ERD-7: Seven common variables: SUP, SAU, UTA, UR, MET, SREG, UTAA 
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ERD is defined as the weighted average of these 7 scenarios: 

ERD = (ERD-1 + ERD-2 + ERD-3 + ERD-4 + ERD-5 + ERD-6 + ERD-7)/7 

For implementing ICN and ICD a value of q=5% has been used. In [6], values 
of q=l%, 2 %, 10% have been used, producing ICN-1, ICN-2, ICN-10 and 

defining ICN = (ICN-1 -|- ICN-2 -|- ... -I- ICN-10) / 10 (analogous for ICD). 
According to the results by Torres [6] we have observed that ICN-5=ICN and 
ICD-5 ^ICD. So, we have defined ICN = ICN-5 and ICD = ICD-5. 

Some graphics based on these results are shown (see appendix). They are 
focussed on two kinds of comparisons: 

Figures A: comparison between number of microaggregated variables 7, 3-1-4, 
3-t-2-t-2 and 4-1-3 

Figures B show the variation of the MG and MG2 measures for different k 
and multi-step aggregation scenarios. 

5 Results 

1. PI measures information loss. PI increases when k increases. For a fixed 
k and file, microaggregation 3-I-2-I-2 is the best and microaggregation 7 is 
clearly the worst. For a fixed k and aggregation of variables, the census file 
always has lower values of PI. The worst results of PI are obtained for files 
5% and 10% with k=10 and k=15 and aggregation of 7 variables. 

2. PC measures disclosure risk. PC decreases when k increases. For a fixed k 
and file, aggregation of 7 variables is clearly the best method and 3-I-2-I-2 the 
worst. For a fixed k and aggregation of variables, file 5% is always the best 
and census file the worst. The best results of PC are obtained for aggregation 
of 7 variables and the worst for 3-I-2-I-2. In fact, these are the expected results: 
increasing values of k and increasing number of aggregated variables produce 
worse results of PI and better results of PC. 

3. MG is a global measure giving the same importance to information loss and 
confidentiality loss. Best results are obtained for census file with aggregation 
of 7 variables and worst results are for sample 10% with aggregation of 
4-1-3 variables and k=15. Aggregations 3-1-4 and 3-I-2-I-2 produced stable 
results of MG, always between 30% and 40%. On the other hand, results 
corresponding to aggregations 7 and 4-1-3 are much more sensitive to k for 
5% and 10% files, because of values of PI. This can be represented on a 
three-dimensional graphic (Figures B): best results of MG are obtained for 
the vertex (variables=7;k=3); moving in any direction makes MG increase. 
Except for the census file, best results are always obtained for a compromise 
between k and ‘variables’. 

4. MG2 is a global measure giving more importance to confidentiality loss than 
to information loss. Best results are obtained for census file with aggregation 
of 7 variables and worst results are for census file with aggregation of 3-I-2-I-2 
variables. Best results of MG2 are obtained for the vertex of the three- 
dimensional graphic (variables=7;k=3); moving in any direction makes MG2 
increase but the increase is lower on the axis variables=7. 




Trade-Off between Disclosure Risk and Information Loss 



315 



6 Conclusions 

Multivariate multiaggregation has been tested on a business data file, varying 
3 different parameters: the sampling rate, the way in which variables are ag- 
gregated and the number of records per group. Measures of disclosure risk and 
information loss have been calculated and two global measures have been defined 
as a trade-off between them. 

The results show that multiaggregation in several steps (vars=3-|-2-|-2) pro- 
duces always worse global measures. Best global measures are obtained for mul- 
tiaggregation in one step, even though not for any number of records per group. 
These results confirm Torres’ study [6]. 

The effect of group size on global measures depends on the way variables 
are aggregated. Anyway, best global measures are obtained aggregating all the 
variables in one step with group size 3 or 5. As a consequence, default group size 
k=10 in /i- Argus could be improved. 

It’s important to notice that different conclusions about the choice of group 
size and the way in which variables are aggregated can be obtained depending 
on the size of the file. This may be related to the number of records in the file 
or to sampling rates. Some analysis with other files should be done. 

A final remark on /i- Argus should be made. The ‘numerical microaggregation’ 
window should provide more information about the kind of microaggregation the 
user is choosing: univariate or multivariate, specially regarding to the ‘optimal 
method’ option. 



Acknowledgements 

This work was partially supported by the European Union project IST-2000- 

25069 Computational Aspects of Statistical Confidentiality (CASC). 

References 

1. Defays, D. and Anwar, N. (1995), Micro-aggregation: a generic method, in Proc. 
Of the 2nd International Symposium on Statistical Confidentiality, Luxembourg: 
Office for Official Publications of the European Communities, 69-78. 

2. Domingo-Ferrer, J. and Mateo-Sanz, J.M. (2002) Practical Data-Oriented Microag- 
gregation for Statistical Disclosure Control, IEEE Transactions on Knowledge and 
Data Engineering, Vol. 14 189-201. 

3. Hundepool, A. et al. (2002), /r- Argus 3.2 User’s manual. Voorburg: Statistics 
Netherlands 

4. Oganian, A. (2003), Security and Information loss in Statistical Database pro- 
tection. Doctoral Thesis. Departament de Matematica Aplicada 4, Universitat 
Politecnica de Catalunya. 

5. Sebe, F., Domingo-Ferrer, J. and Mateo-Sanz, J.M. and Torra, V. (2002), “Post- 
masking optimization of the trade-off between information loss and disclosure risk 
in masked microdata sets”, J. Domingo-Ferrer (Ed.): Inference Control in Statis- 
tical Databases, Springer LNCS 2316, pp. 163-171 




316 



Josep A. Sanchez, Julia Urrutia, and Enric Ripoll 



6. Torres, A. (2003), Contribucions a la Microagregacio per a la Proteccio de Dades 
Estadistiques ( Contributions to the Microaggregation for the Statistical Data Pro- 
tection). Doctoral Thesis. Departament d’Estadistica i Investigacio Operativa, Uni- 
versitat Politecnica de Catalunya. 

7. Yancey, W.E., Winkler, W.E. and Creecy, R.H. (2002). “Disclosure Risk Assess- 
ment in Perturbative Microdata Protection”. J. Domingo-Eerrer (Ed.): Inference 
Control in Statistical Databases, Springer LNCS 2316, pp. 135-152 



Appendix 




Fig. A.l. PI in the microaggregation over the census file 




Fig. A. 2. PI in the microaggregation over a sample of 10% 
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Fig. A. 4. PC in the microaggregation over the census file 




Fig. A. 5. PC in the microaggregation over a sample of 10% 
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Fig. A. 6 . PC in the microaggregation over a sample of 5% 




Fig. A.7. MG in the microaggregation over the census file 




Fig. A. 8 . MG in the microaggregation over a sample of 10% 
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Fig. A. 11. MG2 in the microaggregation over a sample of 10% 
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Fig. B.l. MG crossed by k and variables in the census file 




Fig. B.2. MG crossed by k and vars in the sample 10% file 
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Fig.B.3. MG crossed by k and vars in the sample 5% file 




Fig. B.4. MG2 crossed by k and variables in the census file 




Fig. B.5. MG2 crossed by k and vars in the sample 10% file 
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Fig. B.6. MG2 crossed by k and vars in the sample 5% file 
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Abstract. The CASC-project has brought together a major part of the SDC- 
research community in Europe. This joint effort has resulted in many steps for- 
ward in the research on SDC. Very visible results of this project are the twin 
software packages |t-ARGUS and x-ARGUS. They make many research results 
readily available. This paper will summarize the CASC-project and give an 
overview of the current ARGUS packages. 



1 Introduction 

There is a growing interest for confidentiality issues in official statistics. On the one 
hand side decision-makers demand more and more detailed statistical information. 
And researchers have the capacity to perform complex statistical analysis on their 
powerful PCs and they require detailed microdata. Therefore there is a growing pres- 
sure on the statistical offices to publish more and more detailed information. But on 
the other side Statistical Institutes have to preserve the balance between their task as a 
data provider and their obligation to preserve the privacy of the respondents, who 
have trusted their individual information to them. There is a growing awareness at the 
respondents about their privacy. And without respondents no statistical information. 

The CASC-project is an initiative to coordinate this research and development on 
Statistical Disclosure Control in Europe. The project is partly subsidised by the 
5“ Framework program of the EU. It aims at the combination of research and the 
development of practical tools, the ARGUS-software. CASC concentrates at solving 
both the SDC-problems for microdata as well as for tabular data. 

The NSTs are traditionally very well equipped to carry out large censuses and 
large scale surveys. These sources of information contain very detailed rich informa- 
tion about enterprises and individuals. The lack of power of computer systems is no 
longer a barrier to the composition of very large and detailed tables - the traditional 
output of the NSFs. 

New information systems, online databases and internet based systems of access 
make publishing these large tables a possibility, where previously the physical limit 
of the paper-publications would restrict the amount of detail that could be published. 
For the users of statistical information, (policy makers, researchers etc.) this is a very 
positive development. The NSFs will meet these requests for information using the 
new technology. 
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However, there is another side of the coin. When the NSI’s are collecting the in- 
formation needed to compose these large statistical databases, they have, for obvious 
reasons, guaranteed the confidentiality of the information provided by the respon- 
dents. Whether the information is collected via a voluntary survey or through a com- 
pulsory survey /census, it is vital for the NSI’s to safeguard the respondent's confiden- 
tiality. As well as being a legal obligation this is also vital in maintaining the 
confidence of respondents. If the respondents have the feeling that their sensitive 
information is no longer safe in the hands of the NSFs then response rates will fall, 
and the value of the outputs will drop. 

The 5“ framework CASC project has made a major step forward in the develop- 
ment of practical tools for SDC. The main software developments in CASC are 
|4- ARGUS, the software package for the disclosure control of microdata and 
X-ARGUS for tabular data. Moreover, the CASC-project has also resulted in a long 
and impressive list of research papers. Already some of this research has been built 
into ARGUS while many others results will be implemented in future releases. 

2 The CASC Project 

At several institutes in Europe, universities and NSTs, isolated small groups have 
been working on various aspects of SDC. The CASC-initiative has brought these 
small groups together and joined their forces. In this project partly subsidised by the 
European 5“ framework leading partners from 5 European countries worked together. 
Both NSFs as well as universities participated in this project. 



CASC-participating institutes 



1. Statistics Netherlands 


8. University La Laguna, ES 


2. Istituto Nationale di Statistica, It 


9. Institut d’Estadistica de 
Catalunya, IDESCAT, ES 


3. University of Plymouth, UK 


10. Institut National de Estadistica, 
ES 


4. Office for National Statistics, UK 


11. TUIlmenau, D 


5. University of Southampton, UK 


12. Institut d’ In vestigacio en 
Intelligencia Artificial, ES 


6. The Victoria University of Man- 
chester, UK 


13. Universitat Rovira i Virgili, ES 


7. Statistisches Bundesamt, D 


14. Universitat Politecnica de 
Catalunya, ES 



As the CASC team became rather large we have formed a steering committee rep- 
resenting the five countries: (Statistics-Netherlands, Istat-Italy, ONS-UK, Destatis- 
Germany and Universitat Rovira i Virgili-Spain). 

2.1 Microdata 



SDC for microdata is an emerging topic. As researchers at various institutes can 
nowadays perform complex statistical analysis using the PC’s on their desk there is a 
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strong pressure on the NSI’s to make their rich databases available to them. However 
this cannot be done without adequate protection measures. This research in CASC has 
been concentrated on two major topics: the development and implementation of new 
protection methods and the estimation of the disclosure risk per record. 

A new framework for statistical disclosure control for business microdata has been 
proposed, new masking techniques have been investigated and several micro- 
aggregation methods have been developed. Different approaches for the risk estima- 
tion have been studied and a start of the implementation in p-ARGUS has been made. 



2.2 Tabular Data 

The statistical disclosure control of tabular data has a longer tradition, as tables are 
the more traditional output of the NSFs. For many years research has been carried 
out. Identifying the cells at risk is the more easy part in this field. However the real 
protection of the sensitive cells is the more hard part. This requires the use of very 
complex optimisation techniques to find good solutions. For this leading experts in 
the field of operations research have joined the CASC-team. Efficient algorithms 
based on mathematical optimisation techniques have been developed and imple- 
mented in x-ARGUS. Nevertheless due to the complexity and the size of real-life 
tables, sometimes it is not possible to calculate these optimal solutions, due to the 
mere restrictions of even modern computers. Therefore several good approximations 
have also been developed, leading to safe tables in reasonable time. On the one hand 
we have made solutions based on breaking down the large table in smaller sub-tables 
and achieving solutions this way as well as heuristics based on network flows. 



2.3 Testing 

To achieve practical results to be used in real life situations, testing of the results was 
considered an essential part of the CASC-project. Both the testing of the ARGUS- 
twins as well as the testing of the methodology has received much attention in this 
project. Several partners in the project have testing as their sole task. By including 
them as partners in the project testing has received the attention it needs. 



3 ft-ARGUS 

3.1 Introduction 

The roots of p- ARGUS lie in a view of safety/unsafety of microdata that is used at 
Statistics Netherlands for several years. The incentive to build the first prototype of 
p-ARGUS was to allow data protectors at Statistics Netherlands to apply these gen- 
eral rules for various types of microdata easily, and to relieve them from the time 
consuming task of checking a large set of tables to find possible unsafe combinations 
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in a large micro data set. Not only should it be easy to produce safe microdata, it 
should also be possible to generate the documentation of the modifications of a mi- 
crodata file. 

The aim of statistical disclosure control is to limit the risk that sensitive informa- 
tion of individual respondents can be disclosed from data that are released to third 
party users. In case of a microdata set, i.e. a set of records containing information on 
individual respondents, such a disclosure of sensitive information of an individual 
respondent can occur after this respondent has been re-identified. That is, after it has 
been deduced which record corresponds to this particular individual. So, the aim of 
disclosure control should help to hamper re-identification of individual respondents 
represented in data to be published. 

An important concept in the theory of re-identification is a key. A key is a combi- 
nation of (potentially) identifying variables. An identifying variable, or an identifier, 
is a variable that may help an intruder re-identify an individual. Typically an identify- 
ing variable is one that describes a characteristic of a person that is observable, that is 
registered (identification numbers, etc.), or generally, that can be known to other 
persons. This, of course, is not very precise, and relies on one’s personal judgement. 
But once a variable has been declared identifying, it is usually a fairly mechanical 
procedure to deal with it in |i- ARGUS 

Re-identification of an individual can take place when several values of so-called 
identifying variables, such as ‘Place of residence’, ‘Sex’ and ‘Occupation’, are taken 
into consideration. The values of these identifying variables can be assumed to be 
known to relatives, friends, acquaintances and colleagues of a respondent. When 
several values of these identifying variables are combined a respondent may be re- 
identified. See also Willenborg and De Waal (1996). 

3.2 Statistical Disclosure Control Measures 

To avoid re-identification several techniques are available in |i-ARGUS, like global 
recoding (grouping of categories), local suppression, PostRAndomisation Method 
(PRAM), adding noise, numerical microaggregation and qualitative microaggrega- 
tion. 
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Global Recoding 

In case of global recoding several categories of a variable are collapsed into a single 
one. The effect will be that the number of records with the same value of a key will 
rise. And the risk of re-identification will diminish. On the one hand side it is a very 
powerful instrument in |T-ARGUS to make a safe datafile. Many unsafe keys will 
disappear, but on the other hand a lot of detail can disappear as well. The data protec- 
tor should use this method carefully and also keep in mind that if he cannot solve the 
unsafe keys here, he will have to apply many local suppressions (i.e. impute missing 
values). Researchers using the dataset might not like this perspective and prefer a 
more aggregated categorisation for a variable without all these missing values. 

It is important to realise that global recoding is applied to the whole data set, not 
only to the unsafe part of the set. This is done to obtain a uniform categorisation of 
each variable. 

Local Suppression 

When local suppression is applied one or more values in an unsafe combination are 
suppressed, i.e. replaced by a missing value. This removes the possibility to use this 
key any longer for re-identification. As keys often consists of several variables there 
is a freedom to select one of them for local suppression. Also several unsafe keys can 
be found in one record. ji-ARGUS offers two methods to do this efficiently. One is 
based on the minimising of the reduction of the entropy (i.e. preserving as much as 
possible the information), the other takes into account the priorities set by the user. 

Top and Bottom Coding 

Global recoding is a technique that can be applied to general categorical variables, i.e. 
without any requirement of the type. In case of ordinal categorical variables one can 
apply a particular global recoding technique namely top coding (for the larger values) 
or bottom coding (for the smaller values). When, for instance, top coding is applied to 
an ordinal variable, the top categories are lumped together to form a new category. 
Bottom coding is similar, except that it applies to the smallest values instead of the 
largest. Top and bottom coding for categorical variables can be seen as special case of 
global recoding. 

Top and bottom coding can also be applied to continuous variables. What is im- 
portant is that the values of such a variable can be linearly ordered. It is possible to 
calculate threshold values and lump all values larger than this value together (in case 
of top coding) or all smaller values (in case of bottom coding). Checking whether the 
top (or bottom) category is large enough is also feasible. 

The Post RAndomisation Method (PRAM) 

PRAM is a disclosure control technique that can be applied to categorical data. Basi- 
cally, it is a form of deliberate misclassification, using a known probability mecha- 
nism. Applying PRAM means that for each record in a microdata file, the score on 
one or more categorical variables is changed. This is done, independently of the other 
records, using a predetermined probability mechanism. Hence the original file is 
perturbed, so it will be difficult for an intruder to identify records (with certainty) as 
corresponding to certain individuals in the population. Since the probability mecha- 




328 Anco Hundepool 



nism that is used when applying PRAM is known, characteristics of the (latent) true 
data can still be estimated from the perturbed data file. See De Wolf et al (1998). 

Microaggregation 

Microaggregation is a family of statistical disclosure control techniques for quantita- 
tive (numeric) microdata, which belong to the substitution/perturbation category. The 
rationale behind microaggregation is that confidentiality rules in use allow publica- 
tion of microdata sets if records correspond to groups of k or more individuals, where 
no individual dominates (i.e. contributes too much to) the group and A: is a threshold 
value. Strict application of such confidentiality rules leads to replacing individual 
values with values computed on small aggregates (microaggregates) prior to publica- 
tion. This is the basic principle of microaggregation. 

The method for multivariate fixed-size microaggregation implemented in |j,-Argus 
tries to form homogeneous groups of records by taking into account the distances 
between records themselves and between records and the average of all records in the 
data set; this method will be called MDAV (multivariate microaggregation based on 
Maximum Distance to Average Vector). 

Risk Models 

To be able to distinguish safe from unsafe microdata, it is necessary that a disclosure 
risk model is specified. Disclosure models can differ greatly in their degrees of so- 
phistication. The basic model in ji-ARGUS is a fairly simple such model, namely one 
based on a thresholding rule. The understanding is that a combination of values is 
safe only if the (estimated) frequency of its occurrence in the population (or in the 
file) is above a certain threshold value. 

An individual risk of disclosure allows one to estimate a measure of the chance of 
identification of each record in the released file on the basis of the actual values ob- 
served on the public variables. In the last few years a number of proposals have been 
made. Benedetti and Franconi (1998) propose a methodology for individual risk esti- 
mation based on the sampling weight, which is the approach used in this version of |a- 
ARGUS. 

3.3 p-ARGUS Software 

All these above mentioned methods have been implemented in the current versions of 
p- ARGUS. We will continue to extend and improve p- ARGUS, as our goal is to 
make all the SDC methodology easily available for the data-protectors. However it 
must be stressed that this software tools can only be applied by people with a basic 
understanding of the SDC-theory. p-ARGUS is not a ‘black-box’ that will automati- 
cally produce a safe file. 

The methods available in p-ARGUS can be used to produce datafiles for different 
purposes. We make a basic distinction between datafiles that will be made available 
to established researchers at universities et al. (possibly with a contract) and datafiles 
which will be made available to the general public. It goes without saying that in this 
case much more strict rules have to be applied. 
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Fig. 1. |I- ARGUS main window (overview of the unsafe combinations per variable). 



For more information on |d-ARGUS software we refer the ji-ARGUS-manual 
(Hundepool et al, 2003). 



4 t-ARGUS 

Tables have traditionally been the major form of output of NSI’s. There is also a 
longer tradition of studying the SDC-aspects of tabular data. Even in moderate sized 
tables there can be large disclosure risks. Take e.g. a cell in a table where there is 
only one contributor. The published cell value is clearly the contribution of one re- 
spondent/enterprise. However the situation is more complex. Besides this the protec- 
tion of the unsafe cells in a table is an even more complex task 



4.1 Sensitive Cells in Magnitude Tables 

The well-known dominance rule is often used to find the sensitive cells in tables, i.e. 
the cells that cannot be published as they might reveal information on individual 
records. More particularly, this rule states that a cell of a table is unsafe for publica- 
tion if a few (n) major contributors to a cell are responsible for a certain percentage 
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(k) of the total of that cell. The idea behind this rule is that in that case at least the 
major contributors themselves can determine with great precision the contributions of 
the other contributors to that cell. Choices like n=3, k=70% or n=2, k=80% are not 
uncommon, but x- ARGUS will allow the users to specify their own choice. 




As an alternative the prior-posterior rule has been proposed. The basic idea is that 
a contributor to a cell has better chances to estimate the competitors in a cell than an 
outsider and also that these kind of intrusions can occur rather often. The precision 
with which a competitor can estimate is a measure of the sensitivity of a cell. The 
worst case is that the second largest contributor attempts to estimate the largest con- 
tributor. If this precision is more than p% the cell is considered unsafe. An extension 
is that also the global knowledge about each cell is taken into account. In that case we 
assume that each intruder has a basic knowledge of the value of each contributor 
of q%. 

Additional to these rules a minimum frequency threshold can be applied. 

Holdings play a special role here. In fact the contributions of a holding to each 
cell, also to the marginal cells should be considered as one contribution, when apply- 
ing sensitivity rules. x-ARGUS has provisions for this. 

Also have we extended the sensitivity rules in case of data files from samples. And 
to extend the flexibility we also allow for two sets of parameters for these sensitivity 
rules. This is a common practice in several countries. 

Internationally and also in the Netherlands there is a shift from the dominance rule 
towards the prior-posterior rule. The reasons for this are the more intuitive approach 
and the better numerical properties like the protection levels. Also waivers (contribu- 
tors giving permission to publish their results) can be taken into account more easily. 
See Loeve (2001). 

With these rules as a starting point it is easy to identify the sensitive cells, provided 
that the tabulation package has the facility not only to calculate the cell totals, but also 
to calculate the number of contributors and the n individual contributions of the major 
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contributors. With this information x-ARGUS can apply the sensitivity rules, perform 
the table redesign very easily and protect the table. x-ARGUS will produce the tables 
using the microdata files as a source. It will also calculate the necessary additional 
information (largest n contributors etc.). Traditionally x-ARGUS could only read 
microdata files, but because of the many requests we have added the functionality to 
read ready-made tables as well and protect them. 



4.2 Protecting Sensitive Cells 

A problem, however, arises when also the marginals of the table are published. It is 
no longer enough to just suppress the sensitive cells, as they can be easily recalcu- 
lated using the marginals. Even if it is not possible to exactly recalculate the sup- 
pressed cell, it is possible to calculate an interval that contains the suppressed cell. 
This is possible if some constraints are known to hold for the cell values in a table. A 
common found constraint is that the cell values are all nonnegative. 

If the size of such an interval is rather small, then the suppressed cell can be esti- 
mated rather precisely. This is not acceptable either. Therefore it is necessary to sup- 
press additional information to achieve that the intervals are sufficiently large. 

Several solutions are available to protect the information of the sensitive cells: 

• Combining categories of the spanning variables (table redesign). Larger cells tend 
to protect the information about the individual contributors better. 

• Suppression of additional (secondary) cells to prevent the recalculation of the sen- 
sitive (primary) cells. 

The calculation of the optimal set (with respect to the loss of information) of sec- 
ondary cells, guaranteeing the required protecting intervals is a complex OR-problem. 
X-ARGUS has been build around this solution and takes care of the whole process. 

Table Redesign 

If there are many unsafe cells in a table it is to be expected that even more (secon- 
dary) suppressions are needed to achieve a safe table. A good alternative could be to 
redesign the table by grouping categories together. The resulting table will have less 
unsafe cells to be protected. So the protected table could be more informative and 
useful. It is the decision of the data protector to find a balance between the table re- 
design and the suppressions. 

Secondary Cell Suppression 

Once the sensitive cells in a table have been identified and there are not too many of 
these cells it might be a good idea to suppress these values. In case no constraints on 
the possible values in the cells of a table exist this is easy: one simply removes the 
cell values concerned and the problem is solved. In practice, however, this situation 
hardly ever occurs. Instead one has constraints on the values in the cells due to the 
presence of marginals and lower bounds for the cell values (typically 0). The problem 
then is to find additional cells that should be suppressed in order to protect the sensi- 
tive cells. The additional cells should be chosen in such a way that the interval of 
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possible values for each sensitive cell value is sufficiently large. Several sensitivity 
rules imply what sufficiently large is, but for some other rules the data protector has 
to specify the protection intervals. 

In general the secondary cell suppression problem turns out to be a very hard prob- 
lem, provided the aim is to retain as much information in the table as possible, which, 
of course, is a quite natural requirement. The optimisation problems that will then 
result are quite hard to solve and require expert knowledge in the area of combinato- 
rial optimisation. 

Information Loss in Terms of Cell Costs 

In case of secondary cell suppression it is possible that a data protector might want to 
differentiate between the candidate cells for secondary suppression. By specifying a 
cost-function he can influence the choice of the secondary suppressions. The cell 
value is a possibility, but also the cell frequency could be chosen, or any other vari- 
able in the datafile. Also a power transformation of the cost-function is possible. The 
aim of secondary cell suppression can be summarised by saying that a safe table 
should be produced from an unsafe one, by minimising the information loss, ex- 
pressed as the sum of the costs associated with the cells that have secondarily been 
suppressed. 

4.3 Solving the Secondary Cell Suppression Problem 

Several approaches to solve this problem have been implemented in x-ARGUS, each 
with its own characteristics and advantages and disadvantages 

• The hypercube method (GHMiter) 

• The optimal solution 

• The modular optimal solution 

• The network solution 

The Hypercube Method 

The approach builds on the fact that a suppressed cell in a simple n-dimensional table 
without substructures cannot be disclosed exactly if that cell is contained in a pattern 
of suppressed, nonzero cells, forming the corner points of a hypercube. 

The algorithm subdivides n-dimensional tables with hierarchical structure into a 
set of n-dimensional sub-tables without substructure. These sub-tables are then pro- 
tected successively in an iterative procedure that starts from the highest level. Succes- 
sively, for each primary suppression in the current sub-table, all possible hypercubes 
with this cell as one of the corner points are constructed. 

For each hypercube, a lower bound is calculated for the width of the suppression 
interval for the primary suppression that would result from the suppression of all 
corner points of the particular hypercube. To compute that bound, it is not necessary 
to implement the time consuming solution to the Linear Programming problem. If it 
turns out that the bound is sufficiently large, the hypercube becomes a feasible solu- 
tion. For any of the feasible hypercubes, the loss of information associated with the 
suppression of its corner points is calculated. The particular hypercube that leads to 
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minimum information loss is selected, and all its corner points are suppressed. See 
Giessing and Repsilber (2002). 

An implementation of this method by R. D. Repsilber of the Landesamt ftir Daten- 
verarbeitung und Statistik in Nordrhein-Westfalen/Germany, offers a quick heuristic 
solution. The method has been implemented in x-ARGUS. The advantages are the 
speed of the solution even for very large tables and the fact that this method does not 
require a licence for commercial OR-software like the other solutions. A disadvan- 
tage might be that the solution will not be the optimal one, leading to over- 
suppression 

The Optimal Solution 

JJ Salazar (1998) has developed complex optimisation models to find the optimal 
solution for the secondary cell suppression. The models take into account the primary 
cells to be protected but also see to it that the cells cannot be recalculated to a given 
upper and lower protection level. These models have the flexibility to allow for dif- 
ferent optimisation criteria, so it is possible to minimise the sum of the values of the 
cells to be suppressed, the sum of the frequencies of the individual cells of merely the 
number of cells to be suppressed. 

The original Salazar models could only protect simple unstructured tables, but re- 
cently the models and the implementation have been extended for hierarchical and 
linked tables. Due to all the sub-totals present in these tables the intruder has many 
more options to recalculate suppression pattern and so the optimisation models have 
become much more complex. 

For very large tables the required computing time to find the optimal solution 
might be prohibitive, but then alternatives are available in x-ARGUS. Also it is good 
to be able to calculate the optimal solutions to benchmark the other solutions. 

The solution of these problems requires high performance OR- solvers that are only 
available commercially. In x-ARGUS we have made provisions to solve the Salazar 
models with Xpress or Cplex as an alternative, the two major solvers available. 

The Modular Optimal Solution 

In real life situation most tables of NSTs tend to have one or more hierarchical span- 
ning variable. As the original Salazar model could only handle non-hierarchical ta- 
bles, an approximation has been build which breaks down the large hierarchical table 
into many unstructured sub-tables. This results in a whole tree of small sub-tables. 
Starting at the top this method then protects all these tables. As sometimes the sup- 
pression pattern influences a higher level of the tree a backtracking procedure will be 
carried out. 

At the end of this procedure the whole table is protected. It proves to be a reason- 
able quick procedure, which has enabled us to protect very large table. See De Wolf 
( 2002 ) 

The Network Solution 

Networks are often used in optimisation problems as an approximation of the full 
optimal solution. The advantages are that the solutions are obtained rather quickly, 
often at high quality. Therefore networks have been studied in the SDC area for a 
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longer time. However the conclusions were that networks could only be used prop- 
erly for 2-dimensional tables. On the first sight this might be a serious drawback, but 
many very large tables produced by the NSI’s are 2-dimensional, e.g. the foreign 
trade statistics. 

So Jordi Castro (2003) has developed a network based solution, which is now 
available in x- ARGUS. The first implementation only allowed for non-hierarchical 
tables, but now also very large two-dimensional tables with one hierarchy can be 
solved. 



4.4 The t-ARGUS Software 



All the above-mentioned solutions have been build in x- ARGUS. The aim of 
X-ARGUS is to make it into a control centre for tabular SDC. This will facilitate the 
users to apply the most appropriate method available for problem he faces. Like with 
p-ARGUS X-ARGUS is not a black box, which will just protect a table for you. 
X-ARGUS is a control centre, which helps you to apple the appropriate SDC meas- 
ures and performs the complex computations involved. 

For more information on x-ARGUS software we refer the x-ARGUS-manual 
(Hundepool et al, 2002). 
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Fig. 2. x-ARGUS Table with unsafe cells. 
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5 Conclusions 

The CASC-project has been a major step forward in the research on SDC and also in 
the development of practical tools for SDC, i.e. the ARGUS twins. The CASC- 
project has shown that NSFs and universities can work together in a fruitful way, 
achieving practical solutions. Although much progress has been made, we have not 
reached the final station. We hope that we can continue to cooperate in a future pro- 
ject, as the new developments in information technology imply new challenges in 
SDC as well. 
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Abstract. The right for informational self-determination and freedom of sci- 
ence are basic rights in the German Basic Constitutional Law. Both are very 
important, so the German government is called to fulfill bipartisan needs. It is 
shown here how a well functioning informational infrastructure - especially in 
providing German official microdata to the empirical science - is developped in 
the German legal situation without violating individual rights. Different Ger- 
man official statistics microdata products for empirical research will be intro- 
duced by explaining methods of anonymisation and the related acts in the Ger- 
man Statistics Law. 



1 Introduction: Data Protection and Data Exploration 

The protection of privacy and especially of personal and corporate data is always a 
topic of great public interest. To deign sufficient data protection seems to become 
more and more awkward, especially since the internet offers a lot of resources to 
disclose private data. On the other hand, there is a growing demand of microdata for 
the empirical science to explore the various social issues of modernity. 

At one time it was considered sufficient for data users to work with aggregated 
data like tables and indices given out by the statistical offices. But the accelerating 
change of society and the increasing amount of new societal questions resulting from 
this changed the scientific interest and aggregated data was not adequate to answer 
research questions to a satisfying degree anymore. This occasion made the national 
statistical organisations and of course also the government think about possibilities 
and restrictions of publishing official microdata for research purposes. In the year 
1999 the Federal Ministry on Education and Research (BMBF)* established a Com- 
mission for the Improvement of the Informational Infrastructure (KVI)^. 

It was the constitutional task of this commission to revise the informational infra- 
structure between empirical researchers and the official statistics and to work out new 
concepts for the exchange of data for research purposes between data producers and 
the scientific community. The KVI worked out a number of advices which are elabo- 
rately described in their final report^. 



* Bundesministerium for Bildung und Forschung - BMBF. 

^ Kommision zur Verbesserung der Informationellen Infrastruktur - KVI. 

^ Kommission zur Verbesserung der informationellen Infrastruktur zwischen Wissenschaft und 
Statistik (Hrsg.) “Wege zu einer besseren informationellen Infrastruktur“, Baden-Baden 2001. 

J. Domingo-Ferrer and V. Torra (Eds.): PSD 2004, LNCS 3050, pp. 336-342, 2004. 

© Springer-Verlag Berlin Heidelberg 2004 
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In Germany the importance of data protection was most of all emphazised since the 
population census discussion and the demonstrations involved. That pressure lead to 
the population census judgement and, at last, to an amendment of the German federal 
statistics law in 1987 

The population census judgement^ underlined the protection of the individual and 
his right of self-determination in the decision to provide personal data or not. And 
finally it led to a determined legal system of data access in Germany, which - in com- 
bination with the funding of Research Data Centres - made it possible to provide 
regulated access to official German microdata. Let’s now have a short glimpse at the 
German federal statistics law which in combination with the constitution and the data 
protection law is the legal basis for microdata access in the German official statistics. 



2 German Federal Statistics Law 

The first legal regulation for the use of official microdata was made in the federal 
statistics law in its version of 1981. It allowed the delivery of completely anonymised 
microdata in §11(5) BstatG 1981. This of course left a lot of restrictions, as complete 
anonymisation always goes along with an enormous loss of information. Nevertheless 
this law brought up an epoch-making change, as it offered the first legal opportunity 
for the official statistics to give out so-called Public Use Files (PUF): absolutely ano- 
nymised microdata sets of official statistics. Also, it showed a way how to proceed. 
But due to the enormous loss of information, the files were not very popular for em- 
pirical research. Still Public Use Files are the only German microdata which can be 
purchased from abroad. 

A more satisfying solution for empirical researchers was the next legal improve- 
ment: The federal statistics law in its version of 1987 - especially §16(6) BstatG. It 
brought up the „Privilege of Science", which means, that from that point on, research- 
ers from independent institutions for scientific research were allowed to receive de- 
facto anonymised microdata for clearly defined singular research projects. The Scien- 
tific Use File (SUF) was born. 

Scientific Use Files are de-facto anonymised microdata. This means that the risk of 
disclosure is not absolutely excludable. But disclosure is only possible for a potential 
attacker by investing a disproportionally high amount of time, cost and work power®. 

The advantage of the highly regulated and strictly controlled means of data access 
is that there are clear and understandable norms, making data access democratically 
possible for every researcher under the same rules, independent of financial or other 
influence. 

The disadvantage is that in the actual judgement foreign researchers are still not 
treated equally to German researchers. But as you will see below, the German Re- 
search Data Centres offer new ways, which also enable foreign researchers to explore 
German microdata to a satisfying degree. 



http://bundesrecht.juris.de/bundesrecht/bstatg_1987/gesamt.pdf 
® From the guidelines to the sentence of the first senate of the Federal Constitutional Court 
from December 15th 1983 according the population census judgement 1983 
(http://www.datenschutzzentrum.de/material/recht/leitsatz/leitsatz.htm). 

® http://bundesrecht.juris.de/bundesrecht/bstatg_1987/gesamt.pdf 
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Although a foreign researcher cannot purchase a German standardised Scientific 
Use File, she can get a Public Use File and she can also use the data access ways of 
controlled remote data processing and safe scientific workstations. The reason for that 
will be explained in the concerning chapters below. But still the German law leaves a 
lot to be desired for microdata access from abroad. Most of all, a foreign researcher 
should be treated equally to a domestic one. 



3 Research Data Centres (RDC) of the German Official Statistics 

On October F‘, 2001, the first Research Data Centre of the German official statistics 
opened its gates. The RDC of the Federal Statistical Office of Germany was estab- 
lished in Wiesbaden. On April F', 2002, RDCs in the Statistical Offices of the Federal 
States (Statistische Amter der Lander) drew level with establishing one location in 
each federal state. The Research Data Centres offer a lot of opportunities for micro- 
data access, like Scientific and Public Use Files, onsite access by means of safe scien- 
tific workstations (SSW) or controlled remote data processing (CRDP)^; they also 
support universities with special microdata files for statistical training - the so-called 
Campus Files. 

The ultimate goal of the RDC is the further improvement of the informational in- 
frastructure between official statistics and the empirical science. The advice of the 
KVI is realised by the RDC with respect to the German legal situation and customized 
to researchers’ needs. How this is done, what advantages and disadvantages the Ger- 
man solution has, and what obstacles still have to be removed is described below by 
characterising the different ways for data access. 



4 Scientific Use Files and Public Use Files 

One possibility for microdata use is the purchase of a Scientific or Public Use File on 
CD-ROM. Different surveys are already available in that format. Scientific and Public 
Use Files are anonymised with different grades of anonymisation. 

The Public Use File leaves no way of drawing conclusions about single individuals 
in the surveyed population. Its production is based on the rules in §11(5) of the Ger- 
man Federal Statistics Law* which essentially says that a data set is absolutely ano- 
nymised if as far as anyone can judge no disclosure whatsoever is possible. Abso- 
lutely anonymised data is created by aggregation or deletion of single characteristics 
or drawing of sub-samples^, for example. 



^ Find a good description of the structure of Research Data Centres of the German Official 
Statistics in Wende, Tom and Markus Zwick “Research Data Centres of the Official Statis- 
tics” published in United Nations Economic Commision For Europe and Statistics Sweden 
(Hrsg.) “Statistical Confidentiality and Access to Microdata”, p. 141-146, New York and 
Geneva 2003. 

* §11 (5)BstatG 1981. 

® Find a good overview of anonymisation methods in Kohler, Sabine: Anonymisiemng von 
Mikrodaten in der Bundesrepublik und ihre Nutzung - Ein Uberblick. Forum der Bundessta- 
tistik 31, Statistisches Bundesamt, 1999, 133 - 144. 
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Standardised Scientific Use Files do not exclude the risk of disclosure, but the ex- 
penditure of disclosure for a potential attacker is much higher than the use of disclos- 
ing the de-facto anonymised data*®. The anonymisation methods for SUF are basically 
the same as for PUF, but their intensity is lower. 

With producing standardised SUF and some other activities which will later be re- 
ferred to, it was possible to create a good database for social scientific research. The 
special advantage of the given-out anonymised files is that a researcher is able to 
work with his own software on his own PC, the disadvantage is the loss of informa- 
tion resulting from anonymisation. 

In the area of economic research it is more difficult to approach that kind of com- 
fort. Business data anonymisation is very difficult because of the often small popula- 
tion underlying these data sets". For example, big companies in small regional areas 
can hardly be saved from disclosure. That is the reason, why contentual anonymisa- 
tion like erasure or sample drawing is not sufficient for business data. Methods which 
destroy the realistic information in single cells like i.e. SAFE" are used for anonymis- 
ing business data. As you can imagine, the loss of information is huge, but an inter- 
disciplinary group of researchers and employees of the German FSO are eagerly 
working to eliminate this grievance". But still researchers of economical structures 
are not satisfied with the momentary situation". The topic SDC of business data is so 
wide that is goes beyond the scope of this paper. The short description here should 
merely serve to transport the information that there are already possibilities of work- 
ing with business data and other sensitive data by means of controlled remote data 
processing or special data processing, which is described below. 



5 Controlled Remote Data Processing (CRDP) 
and Special Data Processing (SDP) 



If a researcher needs more information, than a Public or Scientific Use File can offer, 
or if there is no standardised SUF or PUF yet available for a certain survey, there are 
ways to work with less or just formally anonymised data via the Research Data Cen- 
tres. One way is to work in a first step with the anonymised data set, for example a 
standardised Scientific or Public Use File - or if a SUF is not available with a so- 
called structural data set, which corresponds with the original data set in all structural 



" §16(6) BstatG 1987. 

" Sturm, Roland “Anonymisation of economic microdata” METHODS - APPROACHES - 
DEVELOPMENTS - Information of the German Eederal Statistical Office Number 2/2001 
(http://www.destatis.de/download/mve/mad2_2001.pdf). 

" cp. Hbhne, Jorg “SAEE -A METHOD EOR STATISTICAL DISCLOSURE LIMITATION 
OF MICRODATA”. 

(http://www.unece.Org/stats/documents/2003/04/confidentiality/wp.37.e.pdf) 

" e.g. Sturm, Roland: Wirtschaftsstatistische Einzeldaten fiir die Wissenschaft. Wirtschaft und 
Statistik 2, 2002, 101 - 109. 

" e.g. Hauser, Richard/ Wager, Gerd/ Zimmermann, Klaus „Erfolgsbedingungen empirischer 
Wirtschaftsforschung und empirisch gestiitzter wirtschafts- und sozialpolitischer Beratung: 
Ein Memorandum, Allgemeines Statistisches Archiv, Band 82, S. 369-379 . 
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attributes but not in content attributes - and in a second step send the so-produced 
syntax for Software like SAS, SPSS or STATA back to the RDC, where it is proc- 
essed under internal control over the original data. This is called controlled remote 
data processing. Due to the fact that microdata never leaves the secure area of the 
statistical office and is never originally seen by the researcher it is also possible for 
foreign researchers to use this method of analysis. The reason for this is that disclo- 
sure by matching additional information aso. is not possible without touching the 
original data. With CRDP, every step of the research process is under observation of 
an employee of the RDC. An attempt of disclosure would he detected immediately. 

A special form of controlled remote data processing is special data processing. In 
that special form a researcher describes his research interest to a representative of the 
statistical office and the representative does the empirical work. 

One advantage of controlled remote and special data processing for data confiden- 
tiality is that the computing process is not beyond control and the representatives of 
the Research Data Centre exactly know what information is given to the researcher. 
The employee of the Federal Statistical Office is at last instance responsible for the 
anonymity of the aggregated tables he gives out. Another advantage is that the output 
is not microdata but aggregated data usually in the form of tables, which can be ano- 
nymised easier. The advantage for the researchers is that they have the possibility to 
make an exact predication about the whole population with a lower standard error and 
in general a low error variance. Further Advantages are that the consulting function of 
the Research Data Centres can be engaged and there is a possibility to work with 
company data, which wasn’t given before. Additionally, foreign researchers are given 
a way to work with German official microdata which is equal to the way of a German 
researcher. 

The disadvantages are that these processes mean a lot more work and cost for both 
the researchers and the representative of the official statistics, and as a result of this 
they need a lot more time. 



6 Onsite Data Access: Safe Scientific Workstation (SSW) 

Another new way of data access in the protected area of the German Statistical Of- 
fices is the “safe scientific workstation (SSW)”. By working onsite on a SSW, a re- 
searcher gets access to de-facto anonymised data. That means he gets the possibility 
to access microdata by means of sealed-off computers at especially equipped research 
labs in the statistical offices. The researcher has the possibility to work with less ano- 
nymised official microdata which are called Onsite Scientific Use Files. 

The System of the safe scientific workstation is also based on §16(6) BstatG 1987. 
A Special Scientific Use File is provided on a safe workstation inside the statistical 
office. That way of data access also can be used by foreign researchers, because a 
possible breach of confidentiality would happen on German ground and can therefore 
be prosecuted by German law. Otherwise, if a potential attacker from abroad was able 
to purchase a standardised Scientific Use File, he could attempt disclosure in his 
home country, where he is not prosecuteable under German law. 

The difference in anonymisation between an Onsite Scientific Use File and the 
given out standardised SUF, which is also de-facto anonymised, is that the ano- 
nymisation criteria in the onsite case is lower because of other means of confidential- 




Different Grades of Statistical Disclosure Control 341 



ity control like the fact that the visiting researcher is given no way of data transfer, 
except for his aggregated output in form of tables. Further, the log file of his research 
work is saved and can be controlled by an employee of the RDC. And employees of 
the RDC are always around, because the safe scientific workstations are integrated in 
the usual office area of the RDC. 

An Onsite SUF is mostly taylor-made for research issues of a visiting researcher, 
but of course this is only possible if the disclosure risk is humble. 

I will now give you a short example of how an Onsite SUF can be buildup. The re- 
searcher can download the metadata of the material he wants to work with from the 
RDC homepage*^ to prepare his work. Then he should submit a list of variables he 
needs to the RDC. A member of the RDC and a member of the operating department 
check whether all the needed variables can be provided, and afterwards the former 
compiles the data set by dropping the unessential and unavailable variables. After- 
wards, a sub-sample of 80 % is drawn of this material. 

It became clear in the past chapters, that the newly given way of data access by the 
RDC guarantees confidentiality by means of well-balanced anonymisation. There 
exists only one way of access to only formally anonymised microdata: the coopera- 
tion of a statistical office with a researcher in a singular research project initiated by 
the statistical office, also known as a „One Dollar Man Contract”. Such can be done if 
there is a research interest which is primarily useful for the official statistics and sec- 
ondarily for an external researcher. It is then possible for a researcher to sign a termi- 
nable employment contract (with the symbolic payment of one Dollar) with a statisti- 
cal office and work with microdata as an employee of that statistical office and is 
therefore bound to confidentiality like every other employee of the statistical offices. 
But that way of microdata access is an exception which only comes about if a statisti- 
cal office needs external help in a singular project. It is mentioned here just to com- 
plete the menue of microdata access possibilities in Germany. 



7 Campus File (CF) 

The latest invention of the German RDC aims to the improvement of practical statisti- 
cal knowledge at universities and schools: The so-called Campus File (CF). 

The Campus File is a Public Use File, what means that it is absolutely anonymous 
and offers no way of disclosure. It can be downloaded from the internet for free. 

The difference of the CF and a usual PUF is that the idea of the Public Use File fol- 
lows the principle to give as much information as possible with no risk of disclosure 
at human discretion. It is still possible to use a Public Use File for research purposes. 
That makes sense especially for researchers from abroad who have not got the possi- 
bility to purchase a Scientific Use File at the moment, but nevertheless want to work 
with German official microdata. In contrast, the CF is created for statistical training 
only. It does not neccessarily provide veritable statements about the empirical reality. 
That has the advantage that CF can be produced very quickly and easily, because you 
can chose a very high anonymisation level without fearing the loss of validity. 

The method of anonymising a Campus File works as shown in the following: 



http://www.forschungsdatenzentmm.de 
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The basic material of the Campus File is first of all not the original complete data 
set but the standardised and de-facto anonymised Scientific Use File. So you can go 
by the fact that the recoding of single variables which show low numbers in the basic 
population because of special characteristics (very high income or age, small region, 
etc.) is already done to a sufficient degree. And the SUF is already a sub-sample of 
the original data set. 

The Campus File of the Microcensus 1998 for example - which is out now - is a 
5% sample of households of the standardised Scientific Use File. A further ano- 
nymisation criterion is the deletion of about half of the variables of the SUF. The 
deleted variables are mainly all answers to voluntary questions and all sub-samples 
included in the original data set. The result in that case is a file with about 200 vari- 
ables and about 10.000 households and 25.000 persons. 

A School file, which is especially oriented on the needs of school education will 
also be published this year. This file will above all have less cases to provide a better 
overview for simpler educational tasks. 



8 Projects in the Future 

Prospectively, the Research Data Centres are working on the expansion of low cost 
microdata access in form of standardised Scientific and Public Use Files. Above all, 
the development of the Campus File for statistical education will be intensified. This 
year, the CF family will grow with further CFs of the microcensus, the survey of 
income and expenditure, the public assistance statistics, and the income tax statistics. 

Furthermore, the RDC is already keen on developing an online system of con- 
trolled remote data processing, which will give certain registrated users the ability to 
work with official microdata via a client-server network from their own PC. And 
there will be an improvement of consultancy capacity for visiting researchers on safe 
scientific workstations and researchers who use CRDP or SDP. 

The RDCs are working on the central availability of all official microdata and also 
on the elaboration of a widespread metadata-system for all official data. For that rea- 
son, a new project with the title “revaluation and storing of historical microdata” was 
started by the RDC in the year 2004. 

Last but not least the RDCs will engage in pressing the legislator to re-adjust the 
federal statistics law to the new challenges since 1987, to provide a well developed 
informational infrastructure, and to include issues like: foreign researchers’ access to 
German microdata, time as a factor of anonymity, and the matching of historical mi- 
crodata to panel-like data sets. 
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Abstract. The development of new disclosure protection techniques is useful 
only insofar as those techniques are adopted hy statistical agencies. In order for 
technical experts in disclosure limitation to be successful, they are likely to 
need to interact with the appropriate statistical offices. This paper discusses just 
such a successful interaction in the United States. It describes the foundation 
that three major U.S. agencies - the Census Bureau, the Social Security Ad- 
ministration, and the Internal Revenue Service - laid in order to develop more 
useful statistical products. These included a proposed synthetic data public-use 
file based on the confidential microdata from all three agencies. Since then 
other governmental organizations, such as the U.S. Congressional Budget Of- 
fice, have become involved with this inter-agency effort, which seeks to provide 
researchers and other users in the broader statistical community with a data util- 
ity often possible previously only with access to the confidential microdata. 
The confidentiality implications for all three agencies - and the potential for 
more - of a successful conclusion to this work would be enormously beneficial 
to data users, data producers, and data respondents. This paper describes the 
importance of developing the necessary framework, which includes an under- 
standing between statistical office decision makers and the technical experts, 
before beginning such an endeavor. It provides a description of how this effort 
even became possible, and uses the history of events and related lessons to de- 
scribe essentials that might be useful for other national statistical offices facing 
similar constraints and goals. 



1 Introduction 

The development of new disclosure protection techniques is useful only insofar as 
those techniques are adopted by statistical agencies. For technical experts in disclo- 
sure limitation to be successful, they are likely to need to interact with the appropriate 
statistical offices. 

This paper discusses just such a successful interaction in the United States. 

Since 2001 inter-agency efforts have been underway on a synthetic data approach 
to produce a public-use file (PUF), which would combine selected statistical and 
administrative data from three U.S. agencies: the Census Bureau’s Survey of Income 



The views expressed in this paper are the author’s and not necessarily those of the U.S. Inter- 
nal Revenue Service. 

J. Domingo-Ferrer and V. Torra (Eds.): PSD 2004, LNCS 3050, pp. 343-352, 2004. 
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and Program Participation (SIPP), retirement and disability benefits data from the 
Social Security Administration (SSA), and limited earnings data from tax records 
filed with the Internal Revenue Service (IRS). Based on progress so far, the outlook 
for this work is promising. The confidentiality and research benefits of this approach, 
if successful, could be substantial, but details of that technical discussion are left for 
other papers. 

It is important to note, however, that technological advances in disclosure protec- 
tion are necessary, but not sufficient, conditions for the adoption of new techniques. 
This paper focuses primarily on describing the evolution of the legal, institutional, 
and bureaucratic environment that was the critical precursor of the interagency effort. 
Out of the story come lessons that may help other national statistical offices cope with 
similar challenges. 

This story is largely a confluence of separate hut related events: 

• The development of an institutional interagency trust, after a serious test of the 
fundamental relationship; 

• The recognition hy the Census Bureau of the deteriorating tradeoff between data 
quality and data protection in the release of previous SIPP public use files, which 
was influential in deciding to pursue the synthetic data PUF approach; and 

• The development of a new program (Longitudinal Employer-Household Dynam- 
ics) that brought in the technical know-how that permitted the integration of statis- 
tical and administrative data within the new program, and the creation of the 
aforementioned SIPP/SSA/IRS PUF. 

This paper focuses primarily on the first of these, but also notes the relevance of 
the other events. 



2 Background 

Statistical agencies have become increasingly aware that two relatively new chal- 
lenges may seriously affect their ability to release data into the public domain, 
whether in tabular or public-use file format. Increasing capabilities of computing 
power and advances in mathematical/statistical techniques have led to the increase in 
technical re-identification capacity. This challenge is matched by a practical increase 
in this capacity due to the proliferation of datasets in the public and pri- 
vate/commercial domain. In spite of these challenges, the need for publicly collected 
confidential data to inform decisions in both government and the private sector is not 
expected to abate. 

The U.S. tax administration agency, the Internal Revenue Service (IRS), faces ad- 
ditional challenges in its role as an important administrative data provider for the 
Federal statistical system. Tax data have always been particularly susceptible to re- 
identification, both because of their relatively widespread distribution in public form 
and because of their sensitive content. In addition, because publicly and privately 
available datasets are often directly based on entities also in the tax system, there is 
more potential to match to tax data and re-identify taxpayers. Moreover, IRS views 
the protection of taxpayer confidentiality as an essential component of successful 
voluntary tax compliance, upon which the tax system relies. Because of the several 
U.S. statistical agencies authorized to receive confidential tax data, IRS must not only 
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preserve tax data confidentiality within its own administrative system, but also over- 
see the safeguarding of tax data in the systems of the recipient statistical agencies. In 
a related vein, IRS must ensure that the numerous products produced by each statisti- 
cal agency cannot be statistically “cross-matched” and thereby enable complementary 
disclosure of identifiable information. 

Because of these additional challenges, IRS must insist that its safeguarding stan- 
dards be met by a recipient statistical agency, regardless of the agency’s standards for 
data it collects directly. This requirement of compliance with administrative data 
provider standards also influenced the authorization process for statistical use of tax 
data by Census, as will be shown later, but this requirement may differ for other coun- 
tries. For example, the United Kingdom’s Office of National Statistics stipulates that, 
“the same confidentiality standards will apply to data derived from administrative 
sources as apply to those collected. . .for statistical purposes’’^ Nevertheless, the 
unmistakable conclusion is that it is becoming increasingly difficult to release even 
aggregate tabular data into the public domain, and public-use files (often of most use 
to researchers without access to the original source data) pose special challenges that 
are exacerbated over time in the public domain. Although closer coordination of all 
releases is advisable, new methods of confidentiality protection may afford the most 
hope for data users, data providers, and ultimately, the respondents themselves. 

While issues surrounding the disclosure of confidential data are common to all 
Federal statistical agencies, IRS also has its own idiosyncratic issues^. Confidential 
tax data, also known as Federal Tax Information (FTI), have several uses, including 
specifically authorized statistical purposes. The homogeneous treatment of FTI re- 
sults from restrictions in the tax statute, the Internal Revenue Code (IRC), which do 
not allow IRS to distinguish among FTI data elements - even as to age. That is, there 
is no statute of limitations as there is for confidential microdata at statistical agencies 
such as the U.S. Census Bureau. In addition, the tax statute does not distinguish 
among different types of data or taxpayers, so that the Social Security Number of 
John Q. Citizen in Anywhere, USA, would receive the same protection as that of Bill 
Gates which, in turn, would be protected as much as all the financial information on 
any business tax return which Microsoft Corporation might file. Accordingly, all FTI 
- whether entity or tax module information^ - must be treated and protected in perpe- 
tuity as equally sensitive and confidential. This task of protecting confidentiality, 
given the ever-increasing amount of data for which IRS becomes responsible over 
time, is expensive and technically challenging. 



* P. 6, Working Paper No. 11, Contexts for the Development of a Data Access and Confidenti- 
ality Protocol for UK National Statistics, Joint ECE/Eurostat Work Session on Statistical 
Data Confidentiality, Luxembourg, 7-9 April 2003. 

^ Confidential data are any identifiable data whose public release is unauthorized. The re- 
moval of identifier information, such as name, address, and identification numbers, is a nec- 
essary hut insufficient condition to render such data anonymous or unidentifiahle. 

^ An abbreviated course in IRS master files might summarize data maintained on these sys- 
tems (whether individual or business master file) as being one of two types: entity informa- 
tion or tax module information. Entity information refers to information used to identify and 
locate a taxpayer such as Taxpayer Identification Number (Social Security Number - SSN, 
Employer Identification Number - EIN), Name, Address, and perhaps Industry Classification 
Code (NAICS or SIC-based) for a business. Everything else is tax module information. 
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The tax law’s anonymity standard is indiscriminate and absolute in requiring that 
all tax data, whether business or individual, be released in anonymous form. The 
anonymity requirement for data publicly released by IRS also applies to statistical 
agencies authorized to receive FTI. However, although the general standard applies, 
the actual disclosure protection methodology is not specified. The requirement is 
simply that whatever methodology is used be either identical to that employed by IRS 
or else an equivalent approved by IRS. 

The practical question confronting any methodology attempting to meet the abso- 
lute anonymity standard is: From what sort of intrusion must the data be protected? 
Must it be absolutely impossible to re-identify a taxpayer using any means available, 
or is there some less rigid methodological standard? Traditionally, the answer has 
been that tax data must be protected from potential intruders who, using “reasonable 
means,” might attempt to make such a re-identification. Reasonable means include 
the use of reasonably available computer technology, mathematical/statistical tech- 
niques, and a working knowledge of the subject matter to which the data apply. The 
reasonable means standard is a good effort to keep the entire system from shutting 
down and being replaced by a policy of no data release at all - probably the only way 
to guarantee no re-identification. The problem, as can probably be imagined in 2004, 
is that the concept of reasonable means is a technology-relative concept and may be a 
moving target too elusive to be relevant for the absolute standard of anonymity. As a 
result, in a time of increasingly tight budgets protecting the confidentiality of tax data 
is becoming a task virtually impossible to execute successfully. 



3 Developing Interagency Trust 

3.1 A Breakdown in the Relationship 

In 1999 IRS began its mandated triennial safeguards review of a principal U.S. statis- 
tical agency, the Census Bureau. Although the U.S. statistical system is more decen- 
tralized than that of many European Union countries. Census receives the preponder- 
ance of confidential tax data for statistical purposes as a result of the statutory 
authorization conferred by section 6103(j)(l)(A) of Title 26 of the United States Code 
(USC). The implementing Income Tax Regulations specify both the actual items 
authorized for access and their access purpose or Title 13, Chapter 5, USC. 

The mandated IRS safeguards review of Census (and other recipient agencies of 
confidential tax data) is a result of the same section, 6103, which authorizes such 
access in the first place. As a result of the 1999 IRS safeguards review, deficiencies 
in the oversight process were uncovered by IRS, some of which reflected poorly on 
both Census and IRS. For example. Census used tax data for some projects which 
had not received explicit IRS approvals, but IRS had made explicitly clear neither the 
need for such approvals nor the process for effecting them in a coordinated fashion. 

As it became clear that neither Census nor IRS could resolve the resulting crisis, 
intervention at high levels of government became necessary. Eventually, the U.S. 
Office of Management and Budget (0MB), which has broad oversight responsibilities 
for Eederal statistical agencies, helped broker an understanding between the two 
agencies based upon three essential points: 




Developing Adoptable Disclosure Protection Techniques 347 



(1) Census must comply with IRS safeguard standards in order to protect the con- 
fidentiality of tax data, 

(2) informed decisions by policy makers inside and outside government require the 
best possible data available, and 

(3) tax data are so important to these information decision systems that their exclu- 
sion is not a viable option. 

Thus, the conclusion of this process was that IRS, as an administrative data pro- 
vider, and Census, as an administrative data user, would have to find a way to make 
their relationship work in order to satisfy the several stakeholders involved; that is, an 
inter-agency “trainwreck” or shutdown was viewed as unacceptable and would not be 
tolerated. 

As a result, IRS and Census recognized that the increasingly murky and implicit 
boundaries within which their relationship had been struggling were inadequate as 
guidance. Further, a relationship was needed which would not only work but which 
would better accommodate the increasingly complex needs of the many end users. 
Essentially, the relationship needed to be not only re-evaluated but also recalibrated, 
especially to accommodate a new form of confidential data access created by Census 
for outside researchers meeting new Census study needs: the Research Data Center 
(RDC) consortium operated by its Center for Economic Studies. Like statistical 
agencies in other countries'*. Census had realized the need to explore other venues for 
purposes of improving its statistical knowledge base and data quality, but only as a 
result of the IRS safeguards review did this realization include the need to integrate its 
RDC’s into the overall process encompassing its other longstanding functions. 

To meet especially the need on new statistical research uses of FTI, a clear and de- 
tailed understanding that met the mandates of both agencies needed to be docu- 
mented. Accordingly, an IRS-Census policy agreement. Criteria for the Review and 
Approval of Census Projects that Use Federal Tax Information, better known as the 
Criteria Agreement, was mutually devised and eventually signed into effect by both 
agencies in September 2000. At the core of this agreement, available at 
www.ces.census.gov, was the understanding that any data use or access had to be 
authorized by an explicit approval process involving both the data provider, IRS, and 
the data user, Census, and that, especially for outside researcher access, the predomi- 
nant purpose of such access had to be the benefit of Census under its own statutory 
mandate; namely, Title 13, Chapter 5, United States Code. 

In effect, the Criteria Agreement established and refined not only the protocols, but 
most importantly, the authorization to fully legitimize Census use of confidential tax 
data. It was implicit in this agreement that exclusively statistical use was a necessary 
but insufficient condition for authorized access. Instead, an explicit approval by the 
data provider and user was required which attested to the access authorization under 
the statutes of both IRS and Census, the IRS implementing regulations, and the Cen- 
sus-IRS Criteria Agreement’s specific requirements in order to satisfy the record for a 
particular programmatic use. This point is worth emphasizing, as it was not enough 
that data provider and user agreed to the general imprimatur provided by the statutory 
and regulatory bases for proposed access by the user. Because the Census-RDC 



^ For example, see Working Paper No. 10, Research Data Centres of Official Statistics, Joint 
ECE/Eurostat Work Session on Statistical Data Confidentiality, Luxembourg, 7-9 April 
2003. 
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model was seen as at the vanguard, if not the frontier, of data access, it was especially 
important that the record explicitly demonstrate the data provider was convinced of 
the proposed statistical use’s justification. This type of specific dual approval is also 
necessary for another unique data access model with similar high visibility disclosure 
risk; namely, the public-use file. 

Implicit to this inter-agency relationship is the notion that the record of all actions 
taken must be able to demonstrate not only authorized intent but credibility - for 
some pending audience of critics. This inevitable, critical eye is known as third party 
scrutiny, and it is neither hypothetical nor irrelevant, instead consisting of both ex- 
plicit and implicit oversight bodies such as the U.S. Congress’ General Accounting 
Office, the U.S. Treasury Inspector General’s Office, privacy advocates, the media, 
and ultimately, the respondents themselves. In preparing for third party scrutiny the 
record underlying data access should credibly demonstrate that the process has antici- 
pated as many factual questions as possible and that it has also considered perceptions 
as well. Thus, the process needs to demonstrate consistently that it operates within 
not only the letter of the agreement but also its intent - so that accountability, authori- 
zation of the access granted, and purpose are never in doubt. To address both outside 
perceptions and the reality of third party scrutiny. Census and IRS agreed on the im- 
portance of exceeding the literal requirement of the agreement whenever possible. 
For this reason both agencies agreed that it would be a rare occasion demanding 
minimum adherence to predominant purpose as an acceptable criterion; that is, only 
over 50 percent of the access purpose. Consequently, approval on the margin would 
not be the rule, but the exception. 

Perceptions, in conjunction with concerns about third party scrutiny, played a large 
role in this need for dual explicit authorization by data provider and user, especially 
for outside researchers engaged by a national statistical agency such as Census. 
Again, it was vital that access of the provider’s administrative data not be construed 
as a type of unauthorized usage disassociated from or only loosely associated with the 
statistical user’s mandate and mission, especially when the resulting analytical data 
had the potential for affecting groups of respondents. Without explicit evidence; that 
is, the mutual approvals of both the administrative data provider and the statistical 
user signifying that the specific use was authorized, third party scrutiny might raise 
troubling questions as to the type of confidentiality protection assured by the adminis- 
trative data provider, which assumes virtually all risk with its respondent population. 
This issue goes to the heart of accountability in data stewardship. 

One reason for the IRS-Census impasse in 2000 is that there is a fundamental and 
inexorable tension due to the conflicting nature of their respective mandates. Census 
is mandated to use administrative data to the maximum extent possible in order to 
reduce respondent burden and processing costs. IRS is mandated to provide confiden- 
tial tax information only to the minimum extent necessary . This inherent tension 
imposes a sort of de facto equilibrium in the intersection of the agencies’ confidential- 
ity cultures, and only the strongest part of each culture is allowed relevance. It is 
thereby critical to protecting confidentiality, including perceptions of abuse, as both 
data provider and user must bargain hard for an acceptable access transaction that 
satisfies their respective mandates. Critical to such success is a set of clearly defined 
terms and processes, and the documentation of subsequent actions following such a 
process. Equally critical is the devotion of sufficient resources to ensure the needed 
safeguards. Because resources are finite, so must be the amount of access whose 
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safeguarding can be demonstrably credible. Without resource commitment to verifi- 
able standards of protection, the clear implication is that access can approach infinite 
levels, suggesting both an inability and a lack of commitment to safeguard the data 
effectively. 



3.2 Rebuilding the Relationship; Implementation of the Criteria Agreement 

It was clear at the inception of the Criteria Agreement that the many new proposals of 
the RDCs’ outside researchers would be tied to the Census Bureau’s future viability, 
especially its ability to keep up with the new statistical needs of decision makers. 
That is, the RDC project proposals were seen as critical to maintaining the statistical 
heartbeat at Census. 

In fact, most of the FTI access proposals came from Census RDC’s, and initially. 
Census and IRS reviewed these proposals concurrently. This arrangement was soon 
abandoned for primarily one reason. Although it was inefficient for IRS, the adminis- 
trative data provider, to spend time reviewing proposals ultimately rejected by Cen- 
sus, it was critical that the fundamental criterion of all tax data access; that is, a pro- 
posal’s predominant purpose of benefiting Census under Title 13, Chapter 5, be 
demonstrated in proposals that Census, as data user, first approved. That is, the Cen- 
sus review process was supposed to consider not only scientific merit but also Title 
13, Chapter 5, predominant purpose, while IRS review considered only the latter. 
Once it became clear that Census needed to take responsibility for both aspects of 
review (although IRS, as data provider, maintained ultimate control as data owner) 
the human review capital, especially regarding requirements for tax data access, could 
be transferred upstream from IRS to Census, and then from Census to the researcher 
community. Thus, the confidentiality culture needed by the data provider to assuage 
its third party scrutiny concerns was necessarily integrated into the data user’ s confi- 
dentiality culture as well as that of its researcher community. In turn, this culture 
colonized prospective researchers. 

Outside researchers realized they had two critical interests in helping such a system 
succeed. First, the perpetuation of the Census-IRS arrangement allowed the re- 
searcher community access to FTI for authorized purposes, which required undertak- 
ing only proposals within scope. Second, by learning the needed culture, researchers 
could help increase the probability of their own proposals being approved, and even 
increase the number of proposals which might be possible, by theoretically and ceteris 
paribus, shortening the review process itself. 

However, to counter the potential for insincere or even fraudulent researcher be- 
havior, IRS, as administrative data provider, and Census, as data user, also conveyed 
three fundamental understandings to the researcher community. First, cheating on 
proposal purpose would inevitably be self-defeating, as it would destroy the process. 
Thus, implicit, if not explicit, peer-policing among the researcher community was 
essential to the process succeeding, and was encouraged by both Census and IRS. In 
fact, both agencies took pains conveying directly to the researcher community that 
while it might be possible to deceive both agencies’ reviews, it would be at a cost 
fatal to the process. Second, a post-project certification process would be necessary 
not only to satisfy the potential dangers of third party scrutiny by completing the 
authorization process, but also to help increase the knowledge capital of the proposal 
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process itself. Third, the entire process was dynamic and was likely to be re- 
evaluated whenever necessary, to ensure that practice kept up with the multiple needs 
of decision makers, which included not only adequate data but also confidentiality 
concerns and related perceptions. 

The notion of “Census benefit” may require some amplification, as it might differ 
from the statistical benefit required by other countries. For example, in the U.K.’s ten 
principles of protocol, access to confidential data is granted only “where it will [em- 
phasis added] result in a significant statistical benefit”^. This type of arrangement 
appears to require certainty of tangible success, but it may also include a type of bene- 
fit implicitly recognized by the flexibility in the IRS-Census arrangement. That is, to 
re-assure researchers that a fall from the “high wire” of Title 13, Chapter 5, predomi- 
nant purpose attempted by ambitious projects would not necessarily be “fatal”, IRS 
and Census agreed that a safety net of sorts would exist for all projects, especially 
those that failed to meet the criteria in their proposals but made a demonstrably good 
faith effort to do so. However, the good faith effort of failure needed to be docu- 
mented, as did that of success, so that the future proposal process could use these 
outcomes as a learning device for both reviewers and prospective researchers. 



4 Recognition of the Deteriorating Tradeoff 

In the late 1990’s Census became concerned about potential confidentiality problems 
in a previously released SIPP public-use file. These had been detected through ana- 
lytical techniques used by a professional intruder whom Census had engaged contrac- 
tually for just such a purpose. At the January 2002 conference, in which the book. 
Confidentiality Disclosure, and Data Access Theory and Practical Applications for 
Statistical Agencies was showcased and released by Census, Sweeney (2001) ® pre- 
sented some of her methods and how they might be used to re-identify survey respon- 
dents. Part of this methodology relied upon the possibility that variables on the public 
use file might also be individually identifiable in other publicly available datasets. In 
some respects at least, this event served as a type of catalyst for not only the current 
synthetic data approach for the SIPP/SSA/IRS public use file, but also for re- 
examining disclosure risk in the Federal statistical community. 

Although the success of the new Census-IRS relationship was largely predicated 
on a more collegial attitude, it was clear at the outset that this could not be a co-equal 
partnership, as confidential data flowed only from the administrative data provider, 
IRS, to the data user. Census, and not vice versa. However, benefits did accrue. 
Partly as a result of the Sweeney (2001) work, IRS’ own Statistics of Income Division 
decided to subject its public-use file, the tax model file based upon a sample of indi- 
vidual tax return filings, to such an examination and contracted with Sweeney’s labo- 
ratory at Carnegie Mellon University for a professional intruder assessment of its 
confidentiality protections. In addition, because IRS approval of the synthetic data 



^ P.7, Working Paper No. 11, Contexts for the Development of a Data Access and Confidenti- 
ality Protocol for UK National Statistics, Joint ECE/Eurostat Work Session on Statistical 
Data Confidentiality, Luxembourg, 7-9 April 2003. 

^ Latanya Sweeney, “Information Explosion,” in Confidentiality, Disclosure, and Data Access 
Theory and Practical Applications for Statistical Agencies, North Holland, 2001 . 
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SIPP/SSA/IRS public-use file would be required (just as the Census RDC proposals 
required specific approvals) before its public release, IRS was also brought in by 
Census early in the process as a collaborator, not just a reviewer. If the synthetic data 
approach is successful at Census, it will help increase the utility to researchers of non- 
confidential tax data at the same time it reduces the need for access to confidential tax 
data, possibly even at Census RDC’s where the beta testing will oecur. Such a win- 
win outcome would benefit not only the confidentiality protection of administrative 
tax data but also the utility of researcher analysis for decision makers in both govern- 
ment and the private sector. 



5 The Creation of a New Program 

In late 2000, as both agencies began to resolve their differences with work on the 
Criteria Agreement, another Census-IRS crisis was brewing. Namely, a Census re- 
quest to amend the Income Tax Regulations had been submitted in order to enhance 
Census estimates of poverty and income for the SIPP program. The detailed earnings 
items sought were also deemed critical for an emerging Census flagship program, the 
Longitudinal Employer-Household Dynamics study, which sought, among other 
goals, to track more closely employment flows in the U.S. economy. Both requests 
initially encountered opposition, but the justification for each emphasized the minimal 
need for FTI in these mandated uses. Eventually, the regulations were approved in 
February 2001, and immediately after work began on the SIPP/SSA/IRS PUF. It is 
ironic, but not coincidental, that the regulations were approved so soon after the Crite- 
ria Agreement’s implementation in September 2000. That is, the process which had 
prepared both agencies for the Criteria Agreement, also galvanized them for purposes 
of these new proposed uses of FTI by making them focus on the criteria within the 
agreement as well as the protocols and process which would govern such access. It is 
also not a coincidence that one of the goals set forth in the Census justifications for 
the IRS regulations amendment, was the production of a SIPP public-use file, which 
was to include associated administrative data from SSA and IRS. The utility of this 
product was clearly seen as not only a predominant Title 13, Chapter 5, benefit for 
Census, but also a confidentiality boon for administrative tax data in general. How- 
ever, without the items requested for regulation amendment, both SIPP and the poten- 
tial robustness of the proposed LEHD program would have been seriously weakened. 
In fact, had the regulations items not been approved, it is likely that the LEHD pro- 
gram as it is known today would not exist. Had the Criteria Agreement, and even its 
early implementation not been developed as the SIPP and LEHD requests were pre- 
pared and later considered, it is possible, if not probable, that neither would have been 
approved. 



6 Lessons and Recommendations 

One consequence of the modern Census-IRS relationship is that the Criteria Agree- 
ment process undergone to protect confidentiality also laid the groundwork for further 
legitimate access meeting these requirements; for example, the SIPP/SSA/IRS public- 
use file and the LEHD program described above. 
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Another lesson is that the record can probably be satisfied for posterity’s percep- 
tions of the past by ensuring that clear and sufficient documentation exists to explain 
those past intentions and actions. 

The final lesson learned is that agencies must look outside themselves for the tal- 
ents and skillsets needed to help them protect confidentiality and meet the needs for 
which confidential data are collected in the first place. In a time of dwindling budgets 
and competing priorities, such considerations are no longer options - they are impera- 
tives. 

In sum, one of the most important services that government agencies can perform 
is communicating to decision makers the need to learn the lessons above. If avenues 
are closed to such pursuits in the future, decision makers need to understand not only 
that their decisions will be based upon inadequate information - including its quality 
- but also that the imprimatur for intruding on the privacy of respondents-citizens will 
not exist. That is, the mandate for data collection will cease, but so will the ability of 
decision makers to lead and govern. 
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Abstract. In an on-line statistical database, the query-answering system should 
prevent answers to statistical queries from leading to disclosure of confidential 
data. On the other hand, a statistical user is inclined to data mining, that is, to 
disclose pieces of information that are implicit in the (explicit) answers to his 
queries. A key task for both is to find data that is derivable from given summary 
statistics. We show that this task is easy if data is additive and the set of given 
summary statistics can be modelled by a graph. 



1 Introduction 

An on-line statistical database [1] is an ordinary database which contains information 
on individuals (persons, companies, organizations, et cetera) but its users are allowed 
to only access summary statistics over categories of individuals. For example, con- 
sider a bank database which contains a file called DEPOSITOR whose records have 
the following fields: Name, Account, Gender, Age, Balance. The statistical 
users can ask for summary statistics on Balance over arbitrary categories of deposi- 
tors but the categories can be specified by logical formulae involving the fields Gen- 
der and Age, hut not the private fields Name and Account. Typically, such sum- 
mary statistics are obtained using the five aggregation functions: sum, count, max, 
min, average. If/ is any of these aggregation functions, the following are three pos- 
sible instances of a statistical query expressed in an SQL-like language 

Q select /(Balance) 

from DEPOSITOR 

where Gender = Male and Age >25 

Q’ select /(Balance) 

from DEPOSITOR 

where Gender = Female and Age >25 

Q” select Gender, /(Balance) 
from DEPOSITOR 
where Age > 2 5 
groupBy Gender 



J. Domingo-Ferrer and V. Torra (Eds.): PSD 2004, LNCS 3050, pp. 353-365, 2004. 
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It should be noted that Q" is equivalent to the couple {Q, Q’}\ therefore, without 
loss of generality, we can limit our considerations to statistical queries from which 
the groupBy clause is missing. 

According to the terminology introduced in [2], [3] the aggregation functions sum 
and count are called additive, the aggregation functions min and max are called 
semiadditive, and the aggregation function average is called computed. In this paper, 
as in [2], [3] we focus on the special class of additive aggregation functions that take 
on their values from a commutative group (e.g., the set of reals or the set of integers). 
In other words, we only consider sum-queries, which in our bank database are the 
statistical queries of the type sum(Balance). By the value of such a sum-query we 
mean exactly the total sum of the values of Balance reported in the records of the 
file DEPOSITOR that fall in the category specified by the logical formula contained 
in the where clause. Answering such sum-queries (and, more in general, statistical 
queries) raises concerns on the compromise of individual privacy and protection of 
confidential data should be afforded. We call intrusive a sum-query asking for total of 
a sensitive statistic [5], [11], [12]. Let Q be a sum-query of the type sum(Balance) 
on our bank database. If Balance is a confidential field and the value of Q is sensi- 
tive (e.g., according to the threshold criterion), then Q is intrusive. When an intrusive 
sum-query is asked, the query-answering system (QAS) should issue a “non- 
informative” response (see below). The statistical security of a database can also be 
attacked by a nonintrusive sum-query. In our bank database, this is the case if Q is not 
intrusive, but its value combined with the responses to previously answered sum- 
queries of the type sum(Ba lance) can lead to the disclosure of the total balance for 
some sensitive category of depositors. Then, we call Q tricky and the QAS should 
answer 2 as if 2 were intrusive. Finally, if a sum-query is neither intrusive nor tricky, 
the QAS can be safely answer it by releasing its value. The situation can be depicted 
as a competitive game played by the QAS, which has as its opponent a hypothetical 
user, henceforth referred to as the data miner, who at all times is well-informed of all 
answered sum-queries and is able to identify and compute the data that are derivable 
from their responses, that is, those data that are implicitly released. To beat the data 
miner, the QAS should control the amount of information released each time a new 
query is answered by auditing the whole set of answered sum-queries. More pre- 
cisely, given a new sum-query Q, the auditing procedure should first find those data 
that are derivable from the responses to Q and to previously answered sum-queries; 
next, if no derivable data is sensitive, then the QAS can safely answer Q, otherwise in 
response to Q the QAS will issue a “non-informative” answer, e.g., the set of feasible 
values of Q consistent with the values of previously answered sum-queries. In order 
to find the data that are derivable from the values of a given set of sum-queries, one 
can exploit the additivity of the aggregation function sum and model the amount of 
information conveyed by their answers with a set of linear equations with a 0-1 coef- 
ficient matrix [10], [8]. In this paper, we address the derivability problem under the 
assumption that the coefficient matrix is the incidence matrix of a graph, we call the 
query map. This assumption corresponds to a query-overlap restriction, which is 
needed to make the auditing procedure feasible [10], [8] but is powerful enough to 
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deal with two-dimensional tables with arbitrary sets of suppressed cells [9]. We shall 
show that, for sum-queries whose values belong to a commutative group, the deriv- 
ability problem can be solved in linear time thanks to a simple characterization of the 
so-called invariant edges of the query map. 

Example 1. Consider again the file called DEPOSITOR, where Balance is a field of 
real type, and the value-set of the field Age consists of the following three intervals: 
Age < 25, 25 < Age < 45, Age > 45. We assume that Balance is a confidential 
field and that 

Q select sum(Balance) 
from DEPOSITOR 

where Gender = Male and Age <25 

is the only intrusive sum-query. Consider the four sum-queries 

Ql select sum(Balance) 
from DEPOSITOR 

where Gender = Male and Age <45 

Q2 select sum(Balance) 
from DEPOSITOR 

where Age < 25 or Gender = Male and Age >45 

03 select sum(Balance) 
from DEPOSITOR 

where Age > 45 or Gender = Male and 25 < Age <45 

04 select sum(Balance) 
from DEPOSITOR 

where Gender = Female and Age <45 

and assume that the values of Qi, Q2, Q3 and Q4 are 24, 2 9, 18 and 12, respec- 
tively. If the values of the four sum-queries are all released, the amount of informa- 
tion conveyed by their answers is modelled by the following equation system 

Xi + X2=24 
Xi + X2+X4 = 29 
-H ^3 -H >6 = 1 8 
X 4 + = ] 2 

where the variables x\, X2, X3, X4, X 5 and rg stand for the total balances of the deposi- 
tors belonging to the categories specified by the following six atomic formulae: 
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Cl = (Gender = Male and Age < 25) 

C2 = (Gender = Male and 25<Age<45) 

C3 = (Gender = Male and Age > 45 ) 

C4 = (Gender = Female and Age < 2 5) 

C5 = (Gender = Female and 25<Age<45) 

C5 = (Gender = Female and Age > 45 ) 

The coefficient matrix of equation system (1) is the incidence matrix of a graph 
and the corresponding query map is shown in Figure 1 . 




Note that the value of the intrusive sum-query Q is represented by x\. Since the gen- 
eral solution of equation system (1) is 

(xi = a, ^2 = 24 - a, X3 = 2 9 - a - b, rq = b, X5 = 12 - b, X5 = -3 5 -h 2a h- b) 

where a and b are two arbitrary real numbers, the value of x\ is not determined and, 
hence, the value of Q is protected. Suppose now that a new sum-query arrives: 

05 select sum(Balance) 
from DEPOSITOR 

where Gender = Female and Age >25 

and assume that the value of Q5 is 7. If the QAS answers Q^, then the amount of 
information conveyed by the answers to Qi, ..., ^5 is obtained by adding the equa- 
tion X5 H- xg = 7 to equation system (1) and the corresponding query map is shown in 
Figure 2. 

The general solution is now 

(xi = 15, X2 = 9, ^3 = 14 - b, X4 = b, X5 = 12 - b, X(, = -5 + b) 

so that the value of Q is disclosed. Therefore, the QAS should not answer Q5 by re- 
leasing its value, but should issue the set of feasible values of consistent with the 
answers to Q\, . . ., Q4, that is, the whole set of real numbers. 
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Let us now state the derivability problem in formal terms. Let A be an (additive) 
commutative group with zero 0, and let G be a graph (with no isolated vertices) 
where loops are allowed. Without loss of generality, we assume that G is connected. 
A vertex labeling (an edge labeling, respectively) of G is a mapping from V{G) 
(E(G), respectively) to A. Given a vertex labeling q of G, an edge labeling x of G is 
compatible with ^ if x is a solution of the equation system 

XeG£(v)-*(e) = ?(v) (v e V) 

where E{v) denotes the set of edges of G incident to v. A vertex labeling of G is ad- 
missible if there is an edge labeling compatible with it. For example, the null vertex 
labeling (that is, the vertex labeling being zero everywhere) is admissible. Given an 
admissible vertex labeling q of G, we call the vertex-weighted graph (G, q) a map. An 
edge e of G is an A-invariant edge of the map (G, q) if x{e) = x’{e) for every two edge 
labelings jc and x’ compatible with q. Let X{q) be the set of all edge labelings com- 
patible with q. If 0 is the null vertex labeling, then it is easily seen that X(q) is a trans- 
lation of A'(O), that is, X{q) = a: -H X(0), where x is any edge labeling of G compatible 
with q. Therefore, the set of A-invariant edges of the map (G, q) is the set of edges e 
such that y{e) = 0 for all y in A'(O) and, hence, it is the same for every map (G, q). 
Accordingly, the reference to q can be omitted and such edges will be referred to as 
the A-invariant edges of G. 

The problem is to find the set of all A-invariant edges of G and to compute the 
value of each of them given a map (G, q). This problem was solved in linear time in 
the following two cases: 

- A = Z (the set of integers) and G is bipartite [7]: the Z-invariant edges are all and 
the only bridges of G; 

- A = R (the set of reals) and G is arbitrary [9]: the R-in variant edges are all and the 
only edges of G whose removal increases the number of bipartite components 
of G. 

The above result for A = R was achieved using matroid-theoretic arguments, 
which are useless for an arbitrary commutative group A. In this paper, using some 
results of the theory of magic graphs [6] we show that the set of A-invariant edges of 
an arbitrary graph and the value of each of them can be found in linear time. 
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2 Background 

Let A be a commutative group with zero 0. If a is an element of A, by 2a we denote 
the sum a+a. An element b of A is even if there is an element a of A such that b = 
2a. If b is even, by halflb) we denote the set {ae A: 2a = b}. Accordingly, halfiO) = 
{aeA: 2a = 0} = {ae A: a = -a}. For example, if A is the set (0, 1, ..., p-1 } with 
the integer addition mod p then, if p is even, say p = 2k, then halfiO) = jO, k}; oth- 
erwise, halflO) = |0}. The following result is borrowed from the theory of magic 
graphs [6] where only loopless graphs are considered. 

Proposition 1 Let G be a connected, loopless graph and let ^ be a vertex labeling 
of G. 

(0 If G is bipartite, then q is admissible if and only if Xve U ?('') = Zve W ‘liv), 
where { U, W] is the bipartition of L(G); otherwise, q is admissible if and only if the 
sum Xve V(G) <iiv) is even. 

{ii) If q is admissible, then an edge labeling compatible with q can be found in lin- 
ear time using the following algorithm, where by a leaf of a graph we mean a vertex 
with exactly one incident edge that is not a loop. 

Algorithm 1 

(1) Find a spanning tree T of G. 

(2) If G is not bipartite, find an edge e* = (m*, v*) 
whose addition to T creates 

an odd cycle, and set T :=T+ e*. 

(3) For each edge e e E{G)-E{T), setx(e) := 0. 

(4) Until T contains no leaves, repeat: 

Find a leaf u of T. Let e be the edge incident to u and let w be the 
other end-point of e. Set xie) := qiu), qiw) := q{w)-q(u), and delete 
M and e from T. 

(5) If E{T) = 0 (that is, if G is bipartite), then Exit. Otherwise, let { U, W} be 
the bipartition of the vertex set of the tree T-e* with U containing the end- 
points M* and V* of e*. Set b := ZveU ?(0 ~ Zve W Choose an element 
aG halfih) and set x(e*) := a, q{u*) := q(u*)-a and q{v*) := q{v*)-a. Delete 
e* from T. 

(6) Until r is a one-point graph repeat: 

Find a leaf u of T. Let e be the edge incident to u and let w be the 
other end-point of e. Set x{e) := qiu), q{w) := qiw)-q(u), and delete 
M and e from T. 
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(In Appendix I Algorithm 1 is applied to the map shown in Figure 2.) 

Consider now a connected graph G where loops are allowed and let q be any ver- 
tex labeling of G. An edge labeling compatible with q can be always found using the 
algorithm (henceforth referred to as Algorithm 2) obtained from Algorithm 1 by re- 
placing step (2) by the step 

(2) Find a loop e*, and set T :=T+ e*. 

and steps (5) and (6) by the single step 

(5 ) If V* is the end-point of e*, set x(e*) := q{v*). 

(In Appendix II Algorithm 2 is applied to the information model shown in Figure 1 .) 
So, one has 

Proposition 2 Every vertex labeling g of a graph containing loops is admissible and 
an edge labeling compatible with q can be found in linear time. 



3 Characterization of Invariant Edges 

Let G be a connected graph and let Y = A'(O). An edge labelling in Y will be called a 
circulation in G over A (an A-circulation, for short); moreover, if y is an A- 
circulation in G, the edge set {e e E{G): yie) 0} is called the support ofy. Bearing 
in mind that an edge e of G is A-invariant if and only if y(e) = 0 for all y in F, we 
have that an edge of G is A-invariant if and only if it does not belong to the support 
of any A-circulation. Let us distinguish the following three cases: G is bipartite, G is 
not bipartite and is loopless, G contains loops. 

Case 1. G is bipartite. If G is a tree then F = {0} (see Algorithm 1) so that each 
edge of G is A-invariant. Assume that G is not a tree. For every cycle C, no edge in C 
is A-invariant since, arbitrarily chosen a nonzero element a of A, one can construct 
an A-circulation (see Figure 3) whose support is C. 




Fig. 3. An A-circulation associated with an even cycle. 
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Therefore, the A-invariant edges of G are all bridges. On the other hand, if e is a 
bridge of G and G’is either component of G-e, then 

y(e) = [Xve U Zee E(v) yi^)] - [Zve W Zee E{v) yie)] = 0 

where {U, W] is the bipartition of V(G’) and e is incident to U. To sum up, the A- 
invariant edges of G are all and the only bridges of G. 

Case 2. G is not bipartite and is loopless. Let T be a spanning tree of G with the 
addition of an edge e* (see Algorithm 1) creating an odd cycle, say C. Given an arbi- 
trary element a of half{0), with C we can associate an A-circulation (see Figure 4), 
whose support is empty or C depending on whether a = 0 or a ^ 0, respectively. 




Fig. 4. An A-circulation associated with an odd cycle. 

Let Ca be such an A-circulation associated with C. If G = T, then Y = {ca: a e 
halfiP)} (see Algorithm 1) so that, if halflO) = {0} then each edge of G is A- 
invariant; otherwise (that is, if halflO) {0}), an edge is A-invariant if and only if it 
is a bridge. Let now assume that Gi^T and let E{G)-E{T) = [ei, ..., 6^}. The addition 
of £;■ to T creates a closed even walk C/ which is either an even cycle or an L-odd set 
[4], that is, a pair of edge-disjoint odd cycles joined by a (possibly one-point) path. 
Given an arbitrary element a of A, with C,- we can associate an A-circulation as fol- 
lows. If Ci is an even cycle, then the A-circulation is of the form shown in Fig. 3; if 
Ci is an L-odd set, then the A-circulation is of the form shown in Fig. 5. 




Fig. 5. An A-circulation associated with an L-odd set. 
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Let C!,a, be such an A-circulation associated with Ci for some element at of A. As 
proven in [ 6 ], every A-circulation j in G can be written as 

y = Ca + T.i=i,..„k 

for some element a of halfiO) and some elements ai, aL of A. Let us distinguish 
two subcases depending on whether halfiO) = { 0 } or halfiO) 7 ^ { 0 } . 

Case 2(0 : halfiO) = {0}. Then, an edge e is A-in variant if and only if e does not 
belong to any even cycle and to any L-odd set, that is, if and only if either e is a 
bridge and G-e has a bipartite component or e belongs to all odd cycles of G. Note 
that in both cases, e is characterized by the property that G-e has one more bipartite 
component than G. As an example, the R-invariant edges of the graph of Figure 2 are 
the edges (1,2) and (1,3). 

Case 2(H): halfiO) 7 ^ {0}. Then, the A-invariant edges are all bridges since they be- 
long to no cycles. Furthermore, if halfiO) 7 I A then an edge e is A-invariant if and 
only if e is a bridge and G-e has a bipartite component; otherwise (that is, if halfiO) = 
A) then 2 a = 0 for all a so that an edge is A-invariant if and only if it is a bridge. 

Case 3. G contains loops. Let Tbe a spanning tree of G with the addition of a loop 
e* (see Algorithm 2). If G = T then Y = {0} so that each edge of G is A-invariant. 
Otherwise, let E(G)-E(T) = [ei, ..., e/^}. The addition of e/ to T again creates a closed 
even walk C; which is either an even cycle or an L-odd set having e* as one of its 
cycles. Given an arbitrary element a of A, with C/ we can associate an A-circulation 
as follows. If Ci is an even cycle, then the A-circulation is of the form shown in Fig. 
3; if Ci is an L-odd set, then the A-circulation is of either form shown in Fig. 6 . 

a a 



Fig. 6. An A-circulation associated with an L-odd set containing a loop. 

Set Ci,a, be such an A-circulation associated with Ci for some element ai of A. It 
can be proven that every A-circulation y in G can be written as 

y ~ ^i,a,' 

for some elements at, ..., dik of A. Let us distinguish two subcases depending on 
whether halfiO) = {0} or halfiO) 7 ^ {0}. 
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Case 3(0: halflO) = {0}. Then, an edge e is A-invariant if and only if e does not 
belong to any even cycle and to any L-odd set, that is, if and only if e either is a 
bridge and G-e has a bipartite component or e belongs to all odd cycles of G. As an 
example, the graph of Figure 1 has no R-invariant edges. 

Case 3(H): halflO) {0}. If halfiO) ^ A then an edge e is A-invariant if and only if 
either e is a bridge and G-e has a bipartite component or e is a loop and G-e is 
loopless; otherwise (that is, if halflO) = A), an edge e is A-invariant if and only if 
either e is a bridge and G-e has a loopless component or e is a loop and G-e is 
loopless. 

To sum up, we have the following. 

Proposition 3 Let G be a connected graph and A a commutative group. If halflO) = 
{0}, then an edge e of G is A-invariant if and only if either e is a bridge and G-e has 
a bipartite component or e belongs to all odd cycles of G. If {0} <z halflO) cz A, then 
an edge e of G is A-invariant if and only if either e is a bridge and G-e has a bipartite 
component or e is a loop and G-e is loopless. If halflO) = A, then an edge e of G is 
A-invariant if and only if either e is a bridge and G-e has a loopless component or e 
is a loop and G-e is loopless. 

4 Computational Aspects 

A consequence of Proposition 3 is that the set of A-invariant edges of a graph can be 
found in time linear since: 

- the set of bridges whose removal creates one more bipartite component can be 
found in linear time [9], and the same can be easily proven for the set of bridges 
whose removal creates one more loopless component; 

- the presence of a loop whose removal creates a loopless graph can be checked in 
linear time; 

- the set of edges belonging to all odd cycles can be found in linear time [9]. 

Once the set of A-invariant edges of a graph G has been found, in order to deter- 
mine their values for a map (G, q) one can use Algorithm 1 or Algorithm 2, depend- 
ing on whether G is or is not loopless. 

5 An Open Problem 

We solved the problem of finding the A-invariant edges of a map and of computing 
their values, where A is an arbitrary commutative group. The case that A is the set of 
non-negative elements of an “ordered” commutative group is open. However, if A is 
the set of non-negative reals, then the solution is known [9] and it also applies to the 
case that A is the set of non-negative integers provided that the underlying graph is 
bipartite [7]. 
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